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1 . INTRODUCTION 


1 1 SIGNAL PTOCESSING PAST AND PRESBsIT 

Classical signal processing techniques utilized analog technology and 
was characterized by static realizations of low-pass, band-pass, and high-pass 
filters based on the knowledge of signal and noise spectra The end result of 
such analog computation is prone to the error due to noise, drift with 
temperature and aging of resistors, capacitors, etc , used in the design of these 
filters Limited dynamic range and finite signal-to-noise ratio restricted signal 
processing techniques with analog technology Further for many applications this 
approach was not feasible because it was too difficult to optimize the 
performance of the system with too many design parameters This then lead to the 
use of digital computers in signal processing applications In the earlier stages 
digital computer was not used for real-time signal processing applications The 
primary operations involved were filtering, convolution, correlation, FFT etc The 
computational complexity were OCNlog^^N)) for FFT and OCN^) for other operations 
Theoretical advances in signal processing brought in newer concepts like 
adaptive filters, Kalman filters, linear predictive coding model for speech These 
concepts involved operations like matrix-vector multiplication, matrix-matrix 
multiplication, LU decomposition, linear system solution, singular value 
decomposition and Hermitian eigensystem solutions The computational complexity 
of these operations are 0(N®), thus putting great demand on effective 
computational means for realization of these systems Fortunately the advances 
in signal processing theory was concomitant with the rapid advances in 


integrated circuit technology 



In the early days of digital computers, hardware was expensive and 
digital computers had limited capabilities With advances in hardware technology 
parallelism was introduced in the uniprocessor computer which was in the form of 
multiple functional units Different aspects of the parallelism are pipelining the 
CPU, overlapping CPU and I/O operation, hierarchical memory system, 
multiprogramming and time-sharing C14] As scientific computation occupied only a 
small fraction of the computer time, general purpose computers were not fine- 
tuned for scientific applications As science advanced so were the needs of the 
scientific community for fast and greater computing power The requirement of 
greater computational throughput can be achieved by (i) improvement in circuit 
performance, (ii) parallel processing 

Improvements in circuit performance is largely driven by technology 
available at any given time Devices made of silicon are approaching the 
theoretical and technological limits for temperature and speed Even machines 
that use Josephson's devices or have logic devices made of gallium arsenide 
improve the performance over that of silicon only one to two order of magnitude 
Indeed, were the performance of the devices improved even beyond present limits, 
substantial improvement in the overall speed of the computer is not guaranteed 

The ever increasing demands for performance and real-time signal 
processing strongly indicate the need for tremendous computational capability, 
both in terms of volume and speed This fact is also illustrated by the wide 
acceptance of vector processing machine like the Cray-1 The speedup in these 
machines has been achieved through architectural improvements, mentioned before, 
and developments of compilers to map programs designed for serial machines onto 
a vector architecture Despite the impressive speed of these computers, the von 
Neumann approach to their architecture limits their usefulness for computation 


intensive problems Achieving high performance depends not only on using faster 



3 


devices but also calls for a major improvement in computer architecture 
Throughput can be increased by parallel processing, in which multiple computing 
elements work concurrently on the solution of a single problem Concurrency 
implies pipelining, simultaneity and parallelism Such concurrency is available at 
various information processing level such as job or program level, task or 
procedure level, inter-instruction level and intra-instruction level Current 
parallel processing systems can be divided into three architectural configuration 
[143 

* Pipeline computer 

* Multiprocessor systems 

* Array processor 

A pipeline computer exploits temporal parallelism by overlapping the 
execution of different instructions Multiprocessor systems achieves 
asynchronous parallelism through a set of interactive processors Finally, an 
array processor exploits spatial parallelism using multiple synchronized 
arithmetic logic units 

L2 PPELINE COMPUTER 

A pipeline computer is a SIMD processor which works according to the 
principle of pipelining The pipelining concept implies the partitioning of 
instructions into simpler computational steps which can be executed independently 
by computational units Due to the overlap of different instructions, pipeline 
machines are better tuned to perform the same operation repeatedly on a large 
set of data Therefore pipeline computers are more suited for vector processing 
and can be found in most computers designed for such applications Examples of 
such machines are the CDC STAR iOO, Texas Instruments ABC, all Cray machines, 


Cyber-205, Fujitsu VP- 200 and the attached pipeline processors AP-i20B and FPS- 



164 by Floating Point Systems and the IBM 3838 


The limitations of pipelining techniques are - 

i) When the vectors are short, the speedup is small because of the 
relatively large fraction of execution time is wasted in filling and draining the 
pipeline, which causes a delay before the initial results emerge from the pipeline 

II ) Theoretically a k-stage linear pipeline processor can be at most k 
times faster Therefore improvement in speed based on pure pipelining alone is 
small 

III ) It IS difficult to implement either conditional branching or 
subroutine calls dependent on the data flowing through the pipe or operations 
among data while they are still in the pipe 

1 3 MULTIPROCESSOR SYSTEMS 

A basic multiprocessor system contains two or more processors having 
local memories and private devices and sharing access to common*memory modules, 
I/O channels and peripherals devices The entire system is controlled by a 
single integrated operating system providing interactions between processors and 
their programs at various levels Interprocessor communication can be done by 
using - i) time-shared common bus, ii) crossbar switch network and iii) multiport 
memories and a variety of interconnection networks 

13 1 Time Shared Conunan Bus 

The time-shared common bus can be a single bus as in Intel's Multibus II, 
Motorola's VME bus, Texas Instrument's Nu Bus, the proposed IEEE 896 Futurebus 
and Digital Equipment's Unibus and Qbus [121 In a single time-shared bus, the 
performance is decided by the time spent by a processor waiting for the bus, the 


time duration for which the bus was available, and the bus capacity (in bytes per 



second) Contentions for bus is bound to increase as the number of processor 


modules increases and thereby degrading the system performance Multiprocessor 
system is suited for coarse-grain parallelism problems which require a minimal 
amount of interprocessor communication One way to improve the system 
throughput is to connect the processors in a hierarchical cluster by two-level 
buses such that processors in a cluster communicate over one bus and the 
intercluster communication takes place on the other bus Systems based on 
hierarchical bus concepts include the Cm* of Carnegie-Mellon University and the 
Encore's Ultramax system 

1 3^ Crossbar Switch Network 

The crossbar switch evolved through the attempt to overcome the 
potential throughput limitations of systems organized on a time-shared bus An 
example system of this type is the Carnegie Mellon University's C mmp system in 
which the memories and I/O nodes are connected to the processor modules A 
crossbar provides the highest bandwidth and the best performance for a 
interconnected system but at the expense of complexity, size and cost, which are 
proportional to the square of the number of interconnected components Expansion 
of the system is limited only by the size of the switch matrix that is feasible 

13 3 Multiport Memories 

If the control, switching and priority arbitration logic that is 
distributed throughout the crossbar switch matrix is redistributed at the 
interfaces to the memory modules, a multiport memory system results This system 
organization is well-suited to both uniprocessor and multiprocessor system 
organizations A common method to resolve memory-access conflicts is to assign 


permanently designated priorities at each memory port 
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It IS very difficult to justify economically the use of a crossbar switch 
or multiport memories for large multiprocessing systems The cost of the switch 
can be reduced by using a switch with a restricted number of possible 
permutations These networks are less expensive than a full crossbar and 
multiport memories for large multiprocessing systems Moreover they are modular, 
and easy to control and expand 

i 4 AFSRAY PROCESSOR 

The pipeline computer performs well for operations on long vectors 
Bus-based multiprocessor systems are better-suited for algorithms with coarse- 
grain parallelism The array processor is not only architecturally different, but 
performs exceptionally for certain kind of problems They are a cost-effective 
tool for increasing the speed of highly compute-bound processing 

The term "array processor" has been attached to a variety of 
architectures Array processors include FPS's AP-120B, which has a single 
computational unit and the ILLIAC IV with its 64 computational units In this thesis 
the term array processor connotes an attached processor (connected to a host 
computer system) with multiple ALUs, called Processing Elements (PEs) that can 
operate in parallel in a lock-step fashion so that the tandem combination of the 
host and the array processor provides a much higher number-crunching capability 
than the host alone 

To see the motivation for using an array processor, we need only 
consider the addition of two N X N matrices Here the N* additions can be 
performed completely in parallel in a single addition time, were there available 
adders Another example would be in image processing where identical 
transformations have to be carried out on every pixel, and this can be done for 


as many elements as there are processors 
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The idea of using regular array of processors appears to go back to 
Unger in 1958 The first machine actually designed with processor array 
architecture was the Solomon computer, along the lines proposed by Slotnick A 
much more powerful machine, based on Solomon's idea, the ILLIAC IV was produced 
ten years later The experience gained from ILLIAC IV led to the design of the 
Burrough's BSP which also had the feature of instruction streaming Other 
machines based on the original Unger's idea are the ICL's Distributed Array 
Processor and NASA's Massively Parallel Processor 

i 5 VLSI ARRAY PROCESSORS 

A very promising solution to the real-time requirement of signal 
processing is to use special-purpose array processors and to maximize the 
processing concurrency either by pipeLne processing or parallel processing or 
both Such highly parallel computing structures consume hardware voraciously In 
a complementary way, highly parallel systems have structured properties that 
make the application of VLSI look very promising Moreover the remarkable 
advances of VLSI circuitry has lowered the implementation costs for large array 
processors to an acceptable level and has sparked research into the design of 
algorithms suitable for direct hardware implementation 

Two popular special-purpose VLSI array architectures are systolic and 
wavefront arrays, which boast of massive concurrency derived from pipeline 
processing or parallel processing or both The concept of systolic architecture 
was introduced by H T Kung and C E Leiserson for VLSI implementation of some 
matrix operations [7] A systolic system can be defined as C3] - a graph 5=(k',E) of 
n interconnected PEs where the i^ertices represents the PEs and the directed 
edges represents the interconnection between the PEs The PEs operate 


synchronously controlled by a common clock An additional constraint was imposed 



that with the exception of the host all the PEs in P' are to be Moore machines 


(From here onwards the term PEs and cell will be used interchangeably) 

A systolic system consists of set of interconnected cells, each capable 
of performing some simple operations Information flows between the cells in a 
pipelined fashion and communication with the outside world occurs only at the 
boundary cells The basic principle of a systolic architecture is as shown in 
Figure i i By replacing a single PE with an array of PEs or cells, a higher 
computation throughput can be achieved without increasing the memory bandwidth 
The main idea of this approach is to ensure that once a data item is brought out 
from the memory it can be used effectively at each cell it passes while being 
"pumped" from cell to cell along the array This is possible for a wide class of 
compute-bound problems where multiple operations are performed on each data in 
a repetitive manner 



Figure i i Systolic architecture principle 
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The basic features of systolic array are C203 - 

i) Modularity and regularity - Systolic arrays are designed for solving 
a particular class of problem, and the array can be extended simply by adding 
new cells to handle larger size problems Application area can also be widened by 
adding new cells of different capabilities to meet additional functional 
requirements of the new problem The extra constraint of excluding the Mealy 
machines from the array guarantee that the number of PEs can be increased 
independent of the clock period 

ii) Synchrony A single clock is used throughout the system to control 
the states of all Moore machines, i e PEs, in the system All the PEs perform 
either same or different operation on different data rhythmically and pass the 
data and partial results through the array 

111 ) Pipelineability The systolic array exhibits a linear rate 
pipelineability i e it achieves 0(M) speedup, in terms of processing rate, where M 
IS the number of PEs or cells in the array 

iv) Spatial locality and temporal locality The array manifests a locally 
communicative interconnection structure, so that there is a high probability that 
the data needed in the near future may be located close to the data currently in 
use 1 e spatial locality T emporal locality is evident in a systolic array because 
the data keeps moving through the array in a systematic manner and there is at 
least one unit-time delay, so that data can be transferred from one cell to 
another 

The prominent features of systolic arrays are the PEs and the regular 
interconnection between the PEs These features can be implemented either in 
software or with general-purpose DSP microprocessors or using specialized 
hardware The required level of granularity should be the main consideration for 


using any of these options 
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For some low-precision digital and image processing applications, it is 
advisable to use very simple processing primitives A good example of a 
commercial VLSI chip is NCR's Geometric Arithmetic Parallel Processor or GAPP 
Many DSP applications require the PEs to include more complex primitives 
Examples of commercial chips with a large granularity are INMOS's Transputer, 
NEC's data flow chip lipd728i, programmable DSP chips like ADSP-2100, TMS320, 
etc Some of the DSP algorithms require specialized hardware like floating-point 
ALU and multiplier, high speed RAMs, fast coefficient table addressing, etc To 
meet such requirement dedicated chips called DSP building blocks can be used 

The concept of systolic architecture has been studied extensively over 
the last few years Some of the systems built to put this concept to use includes 

i) The Warp [153 - designed by Carnegie Mellon University and its 
industrial partners, is used for signal and image processing tasks, low-level 
vision processing and scientific computing This machine has three mam units the 
array unit, which consists of 10 or more identical cells, each with a throughput 
of 10 MFLOPS, connected in a linear 1-D array, the interface unit, which handles 
the communication between the host and the array, and the host, for supplying 
data to the array and to execute that part of the program which is not mapped 
onto the array unit 

ii) The SystolicXCellular System [83 - designed at Hughes Research 
Laboratories, and is used for large classes of linear algebraic and cellular 
operations that are used in signal processing applications It consists of a 16 X 
16 mesh-connected array processor, a controller and dual-ported memory between 
the array and the host Each computational unit in the array is a custom built 
VLSI processor having 32 bit fixed-point dual-bus processor with a bit slice 
structure The maximum system performance is in the neighborhood of 450 MFLOPS 


111 ) Matrix-i [83 


Designed by Saxpy Computer Corp for matrix 



n 


operations like matrix eigenvalue, singular value decompositions, matrix 
multiplication, Cholesky decomposition and QR factorization The system consists 
of - i) system controller, a DEC VAX, ii) matrix processor, a linear array of up to 
32 pipelined floating-point processors that have systolic and global 
interconnections, iii> system memory, that stores data for the matrix processor, 
iv) the mass storage system, for high speed data-storage peripherals All these 
blocks are interconnected by the Saxpy interconnect The peak performance of 
Matrix- 1 can reach up to iOOO MFLOPS and computation rate in excess of 300 
MFLOPS have been achieved 

1 6 OBJECTIVES AND ORGANIZATION OF THE THESIS 

A Systolic Array Signal Processer (SASP) was defined at a system level 
by Nemawarkar Ci9] The SASP system consists of a linear array of identical 
cells, an interface unit and a host The objective of this thesis is to design and 
test a powerful and flexible cell for use in SASP 

The systolic cell to be designed should have the following features - 

i) It should be programmable 

ii) The cell architecture should employ mutiple pipelined functional 
units 

111 ) The cell architecture should be optimized for a variety of 
algorithms encountered in signal processing algorithms 

iv) The data format(s) to be used should have wide dynamic range 

The thesis is arranged into the following chapters 

With the brief introduction to signal processing techniques and the 
systolic architecture in particular in this chapter. Chapter 2 describes the 


architecture of the SASP system 
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In Chapter 3 we discuss the architectural considerations that were 
tak en into account for deciding the cell architecture Also included in this 
chapter is a survey of various DSP building blocks available commercially 

The design and implementation of the cell based on the cell 
architecture derived in Chapter 3 is explained in Chapter 4 

Chapter 5 describes the instruction set and the software utilities 
developed for programming the cell This chapter concludes with an example of 
matrix-matrix multiplication 

Finally in Chapter 6 we conclude the thesis with a note on the current 
implementation of the cell and suggestions for enhancement for the same 



2. SYSTEM ARCHITECTURE 


Systolic arrays are algorithmically specialized architectures 
Algorithms dictate the layout, the interconnection pattern among the cells and the 
structural features like pipelining of data and partial results Systolic systems 
are used as attached processors to a general-purpose computer and therefore 
should be able to adapt itself to different algorithms This chapter lists 
different types of systolic array structures, followed by the SASP (Systolic 
Array Signal Processor) system architecture [191 

2 1 SYSTOUC ARRAY STRUCTURES 

Over the last few years many researchers have tried to map different 
algorithms onto a systolic array Fortes et al have compared many existing 
methods for constructing a systolic array [ill CAD software packages have also 
been developed which can generate systolic array structure for a given 
algorithm Such structures can be classified into five broad categories They 
are 

i) one-dimensional linearly connected array 

II ) two-dimensional mesh connected array 

III ) two-dimensional hexagonally connected array 

iv) binary tree connected array 

v) triangular array 

Figure 2 1 shows various systolic structures and Table 2 i lists some 


of the computations which map well onto these arrays 



Array structure 


Applications 


One-dimensional linear 
array 

T wo-dimensional mesh 
connected array 

Two-dimensional hexagonal 
array 

Binary tree 


FIR filter, ID and 2D convolution, discrete Fourier 
transform, solution of triangular linear systems 

Graph algorithms, transitive closure, minimum 
spanning tree, shortest path, dynamic programming 
2D convolution 

Matrix multiplication, LU-decomposition, Gaussian 
elimination, QR decomposition 

Searching algorithms, queries on nearest neighbor 
rank, etc, parallel function evaluation, recurrence 
evaluation 


Triangular array 


Formal language recognition QP decomposition 
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In Section i 1 we had refered to some of the computations encountered 
in modern signal processing techniques Most of these computations can be 
categorized into two distinct operations namely, i) matrix-vector multiplication 
and ii) matrix-matrix multiplication Matrix-vector multiplication algorithms can be 
implemented very easily on a one-dimensional linearly connected array Similarly 
matrix-matrix multiplication can be mapped on a two-dimensional hexagonal array 
We now consider the time complexity of these two arrays for matrix-vector and 
matrix-matrix multiplication 

Consider a one-dimensional linear systolic array containing L PEs 
(Figure 2 2a) We will use cut-and-pile mapping (see Appendix 1) to map a large size 
problem onto a fixed size array The objective is to compute the vector Y=AX+B, 
where A is an N-by-M matrix, and X and B are vectors of dimensions M and N 
respectively The time taken, in terms of number of clock cycles, to compute all 
the elements of Y is T = (2NM/L + 2L - 3) C131 
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Figure 2 2 Linear and hexagonal array with out-and-pile mapping 
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Now consider the computation E=FG+H to be performed on a L-by-L 
hexagonally connected PE array (Figure 2 2b) F, G and H are matrices of size M- 
by-N; N-by-P and M-by-P respectivley Once again using cut-and-pile method to map 
a large size problem onto a array of fixed size, the total time, in terms of 
number of clock cycles, to compute all the elements of matrix E is (3MPN/L + 
3MP/L + L) [131 

To compare these two structures, we define a term "array utilization" 
as U=Ti/nTn , where n is the number of PEs in the array, is the number of 
clock cycles needed to solve the problem with only one PE and Tn is the number 
of clock cycles needed to solve it on the systolic array with n PEs The ideal 
value of utilization factor for an array is unity The utilization factor of the 
one-dimensional array approaches i/2 for large values of N and M Similarly the 
utilization factor for the two-dimensional hexagonal array approaches 1/3 when 
NMP » L® 

From the above discussion we can conclude that one-dimensional linear 
array would be more suitable for signal and image processing applications 
Moreover from Figure 2 2 one can see that the interconnections between PEs in a 
two-dimensional hexagonal array is more complicated than the interconnection in a 
one-dimensional linear array Based on the above considerations we settle for the 
one-dimensional linear array for the 5ASP system 

22 SASP SYSTEM ARCMTECTURE 

The SASP architecture is similar to the Warp system developed at 
Carnegie Mellon University [153 The SASP machine will work as an attached 
processor to a general purpose host, a PC/XT in our case The system consists of 
four major parts the host, the interface unit (IFU), the processor array and the 


inter cell communication channel (see Figure 2 S’* 
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Figure 2 3 SASP system architecture 


i) Host The host supervises the operation of the array by supplying 
data to the processor array and receiving the results back from the array It is 
also used to download the programs into the cells Further, the host may also 
perform those part of the computations that cannot be efficiently mapped onto 
the array Any decision making during the execution of a problem is also taten by 


the host 














ii) Interface Unit OFU) The systolic array requires the input data to 


the array be fed at a rate as determined by the algorithm to be executed 
Similarly the rate at which the array outputs the results is also specific to a 
problem During execution of large size problems on fixed small size array, 
intermediate results are generated, which will have to be fed back again into the 
processor array until the final output is available All these complex data 
handling cannot be done by the host, reason being the communication bandwidth (in 
terms of number of bytes/sec) of host cannot match the bandwidth requirement of 
the array Further, it is desirable that the host and the array operate 
independently 

This then requires a special unit (called the IFU) which can match the 
bandwidth of the host and the processor array Apart from matching the bandwidth 
the IFU has the following functions - i) to route data according to the array 
configuration i e , forward or backward, ii) receiving intermediate results and 
looping them back into the array, iii) receiving and storing the final results, and 
iv) to provide global clock and synchronizing signals to the processor array 

111) Processor Array This is the computing unit of the whole system 
It consists of identical cells, connected to two adjacent cells on either side Each 
of the cell is a programmable horizontal microcoded system, with its own program 
sequencer, address generator, floating-point multiplier and ALU and data memory 
All the cells and the IFU are connected to a broadcast bus (BC-bus) which is used 
only by the host to communicate with the IFU and the cells The main usage of this 
bus IS to download the microprogram to the IFU and the cells, load initial data in 
the data memories, and read data/results from the data memory in case when the 
cell interrupts the host due to abnormal program termination Section 2 4 


describes the BC-bus in more detail 



iv) Intercell comnKjnication channel In 5ASP intercell communication 


takes place on two communication channels called the X and Y channel These 
channels are to be implemented using FIFOs The details of these channels are as 
follows 

X channel This unidirectional channel originates at the IFU and Shds 
at the last cell, i e . the right-most cell Data which will be used for multiple 
computations is transmitted over this channel The X data passes through each 
cell unmodified, with each cell using the data on the X channel to perform some 
computation locally 

Y channel This is a bi-directional channel, whose direction is 
controlled by the algorithm to be executed This channel forms a closed loop - 
originating at the IFU, traveling through the cells and finally terminating at the 
IFU Partial and final results moves on this channel This channel helps in 
implementing a large size problem on an array with limited number of cells 

The communication channel also has hardware flow control When a cell 
tries to read from an empty queue of the neighbour cell, it is blocked ( i e , the 
cell does nothing ) until data arrives Similarly, when a cell tries to write into 
the neighbour cell queue which is full, the writing cell is blocked until, at least, 
one data is removed from the queue The blocking of a cell is transparent to the 
user program Except for the blocked celKs), all other cell(s) in the array 
continue to operate normally 

As each cell is to have its own address generator, it was felt that a 
separate address queue will not be necessary The address generator should be 
capable of generating linear address, address for circular buffer, bit-reversed 
address which are frequently encountered in signal-processing techniques 
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23 BROADCAST BUS 

As stated earlier the broadcast bus is used only by the host to write 
into, or read from the memories of the cell The signals on this bus are as 
follows 

1) Data bus This data bus is used by the host to transfer data to/from 

the cells 

2) RESET* This is a software reset signal used to reset the sequencer 
in the cell This signal is also used to synchronize the execution of all the cells 

3) WCS (Write Control Store) Assertion of this signal forces the 
sequencer (ADSP-14Qi) in a cell to execute a WCS instruction After executing this 
instruction the sequencer addresses the microcode memory sequentially, starting 
from 0000, so that the host can program the cell 

4) Cell address The four bit address is used to select a cell or the 
IFU, with which the host wants to communicate Out of the sixteen addresses one 
address is reserved for broadcast purpose 

The address assignments are as follows - 

0 interface unit (cell 0) 

1 to 13 one of the fourteen SASP cells 

14 broadcast address i e , all the cells are selected 

15 no cell is selected 

5) Clk This IS the global clock supplied by the IFU and is used by all 
the cells in the system to time its activities 

6) HWR* This signal is supplied by the host and is used to latch data 
from the host in the microcode memory and the data memory 

7) HRD* This signal is supplied by the host and is used to read data 
from the microcode memory and the data memory 



2 4 PROTOTYPE SASP 


Work on the SASP was initiated in 1988 by Netnawarkar [19] where he had 
defined the system architecture for SASP Based on this design a prototype of 
the IFU for SASP was designed by Samit [213 and Usman [173 The salient features 
of this unit are - i) it is microprogrammable, and has ii; a X memory which holds 
the data to be passed on to the processor array, iii) a YA memory which holds the 
initial values and final results received from the processor array, iv) a YB 
memory which holds the intermediate results output by the processor array This 
unit has 8 bit data path throughout 

A small cell with limited capability was also developed by them to test 
the working of their IFU The cell was implemented using a 8 bit multiply-and- 
accumulate chip (ADSF-i008A) This cell had a local memory of 8192 bytes to store 
the local data The X and Y communication channels in this cell was implemented 
using a simple latch The IFU was able to transfer data to/from its X, YA and YB 
memories to the latches in the cell The IFU along with the cell was able to run 
programs like matrix-matrix multiplication, linear convolution and AR filtering 
Further, Usman has also discussed a design for 32 bit cell with enhanced 
capabilities 

In parallel with the design of a 32 bit cell, as a part of this thesis, a 
32 bit IFU has been designed by Shera [233 to complement this new cell 

Summarizing, the linear array structure of SASP simplifies intercell 
communication, without reducing the communication bandwidth between the adjacent 
cells New cells can be added, because of their modularity For an array with 
large number of cells, global synchronization can be a problem Keeping this in 
view, run time flow control was incorporated in the architecture so that switch 


over to wavefront architecture is feasible 



3. CELL ARCHITECTURE 


The SASP cell should be able to efficiently perform the computationally 
intensive portion of signal processing This requirement justifies the inclusion 
of high-performance hardware in the design of the cell The considerations to be 
taken into account before deciding the cell architecture are taken up in the next 
section This is followed by a brief survey of the DSP building blocks available 
commercially The final configuration of the cell is developed in the last section 

3 1 ARCHITECTURAL CONSIDERATIONS 

Computations encountered in signal processing are regular and 
repetitive For these applications the cells in SASP can be uesd either in 
pipeline or parallel mode Pipeline mode of computation require a high intercell 
communication bandwidth Parallel mode of operation requires the cell to be more 
powerful and capable of operating independently For the cell to operate in 
either of these two modes the following considerations were taken into account 

3 11 DSP System Alternatives 

Presently available hardware offers two alternatives for the design of 
DSP systems They are - i) DSP microprocessors and ii) microprogrammable basic 
building blocks 

A DSP microprocessor contains all the computational elements on a 
single chip and has separate bus structures for program and data memories (the 
so called Harvard architecture) The resulting improvement in memory bandwidth 


(i e , number of bytes transferred per second) is still not sufficient for real-time 
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or high-speed signal processing Operations like filtering, convolution, etc , 
require multiple operands for each computation and the instruction sets of the 
currently available DSP microprocessors cannot address multiple operands 
simultaneously because of their limited number of data paths to and from the 
external world^ Further, these microprocessors do not allow the user to use all 
the inherent parallelism available in the hardware For example, a STORE to 
memory could logically be accomplished at the same time as a JUMP to a branch 
routine This restriction is mainly due to the vertical microprogramming used in 
the control unit of these microprocessors In vertical microprogramming the 
instruction bits are fully utilized with each instruction specifying a single 
discrete operation 

A microcode-based system design using basic DSP building blocks offers 
a high degree of functional parallelism and hence a higher throughput The main 
features of such a system are - i) multiple functional units, ii) multiple 
interconnections and iii) multi -operation instructions The system can be tailored 
to meet the specifications by having multiple functional units and data paths not 
available otherwise Further, microprogramming gives direct access to the 
internal parallelism of the hardware The designer can either employ a complete 
horizontal microprogramming, where the instruction bits are not fully utilized and 
several operations can be initiated simultaneously, or use a combination of 
horizontal and vertical microprogramming depending upon the system architecture 

For a high degree of functional parallelism the cell should include the 
following functional blocks - 


^This is true for only those DBP microprocessors which employ a pseudo 


Harvard architecture, for example TMS320i0 



* ALU capable of addition, subtraction and logic operations 


* Multiplier 

* Data-memories 

* One-to-one connection between different blocks i e , crossbar switch 
^ Address generators for data-memories 

^ Program sequencer 

3 12 Cell Opbxnazation For Signal Processing Applications 

As mentioned in Section 2 1 most of the signal processing algorithms 
involve matrix-vector or matrix -matrix multiplications The main arithmetic 
operation encountered in signal processing is the inner product or the Multiply- 
and-ACcumulate (MAC) operation Cil 

This operation is so common that we can as well provide a MAC unit 
without requiring an intervening store and load of the product to and from a 
high-speed register Since our system is not meant to be fine-tuned for a 
particular operation, it is better to have a separate multiplier and an ALU with a 
provision to load the multiplier output directly into the ALU Moreover providing 
just a MAC will reduce the functional parallelism as we cannot initiate a separate 
multiplication and addition simultaneously 

To properly integrate the SASP in a variety of host environment, the 
number representation system used both by the cell and the host should match 
exactly Further it is desirable that the number system has a wide dynamic range 
Both these features are available in the IEEE 754 standard for 32 bit single- 
precision and 64 bit double-precision floating-point number system C93 

Division operation is not encountered frequently in signal processing 
applications Therefore instead of providing a separate unit, division can be 


programmed in software by a small look-up table and a Taylor series expansion 
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involving multiplications and additions only C4] 

313 Vector And Scalar Processing Capabilities 

Supercomputers like Cray-1, VP-iOO, etc , provide -one set of hardware to 
perform vector operation and another set to perform scalar operations 
Significant saving in hardware can be achieved if the same hardware can be used 
for scalar as well as vector processing Vector processing is repetitive in 
nature and data is stored in consecutive memory locations to facilitate 
addressing of the data On the other hand occurance of scalar operations are 
usually more random and the data arrangement in memory may not be sequential 
To strike a balance between vector and scalar computations without degrading 
either of them it is necessary to use (a) functional units (ALU and multiplier) 
with a minimum number of pipeline stages (two or three stages of pipelining gives 
an optimal balance between the scalar and vector computation Cil ), and (b) 
separate address generator capable of fast memory access, with little 
programming overhead, for a variety of access modes like sequential, circular 
buffer, bit-reversed and non-sequential 

3 14 Data Memories 

To initiate a triadic operation like a=b^Kc-(-d every clock cycle, we must 
be able to supply three operands to and reoeive one result from the functional 
units As only one operand resides in the cell and the other two operands will be 
received from the neighboring cells, there will be memory contention as a cell 
tries to access the memory of its adjacent cell for data transaction 

A simple solution to the above problem will be to provide two dual- 
ported memories in each cell, one for cell^.j^ and another for cellj^^^ There are 
two main drawbacks of this approach Firstly, there is no way of detecting the 



fact that cellj_£ or cell^^^ has overwritten a memory location before cellj^ could 
use the previous data Secondly, address lines will have to be routed from cell^^j^ 
and cellj_^ to cell^ 

Noting the fact that the access to the data received by cell^^, from its 
neighbor is going to be always sequential [20,211, the above said drawbacks can be 
avoided by replacing the dual-ported memory with a FIFO Usage of FIFO obviates 
the need for the address bus and also blocks any attempt to overwrite a data 
before cellj^ can read it E163 

3 15 F unctional Unit Interconnections 

Keeping the functional units like ALU and multiplier occupied in every 
clock cycle requires a high bandwidth interconnection between the memory and the 
functional units Interconnection schemes range from a single bus. where only one 
item can be transferred at a time, to a fully connected crossbar switch where all 
possible connections can be made simultaneously Crossbar connection has the 
highest efficiency and offers the highest bandwidth and therefore it is decided 
to use a full crossbar to interconnect the memory and the FIFOs to the ALU and 
the multiplier The crossbar can be implemented using PALs or multiplexers and 
demultiplexers 

316 Register File 

A register file increases the communication bandwidth by using 
multiport memory Inclusion of a register file facilitates the following 

i) Flexible data flow between the multiplier and the ALU 

ii) To feedback the output of ALU to its input with some delay, useful in 
mac operation 


111 ) As the data memory can be used either in read or write mode in a 
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single clock cycles the register file could be used to hold those data which will 
be written to the data memory at a later clock cycle 

ii) Gather^ and scatter operation for sparse matrix without using 
expensive hardware like bit i/ector as in Cyber-205 [143 


The cell architecture shown in Figure 3 1 is based on the above 
mentioned considerations and the system architecture as described in chapter 2 



Figure 3 1 Cell architecture 


sparse matrix contains few non-zero elements To reduce the 
computation time for such matrices the non-zero elements are gathered and a new 
array is formed This array can be stored in the register file and pipelined 
through the functional unit at a maximum speed The result of these computations 
can be stored back in the register file Finally these results are scattered in 
the data memory in appropriate locations of the original array 
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32 DSP BUILDING BLOCKS 

Advances in VLSI have made it possible to squeeze in more and more 
hardware in a single chip This had lead to the design of different efficient 
functional units that can be used for real time digital signal processing In this 
section we will consider some of these functional units 

32 i Program Sequencer 

A program sequencer's mam task is to sequence the microprogram by 
providing appropriate address for the next instruction in every cloci! cycle The 
sequencer has several other functions and are as listed below 

i) Handle normal program flowj incrementing the program counter by one 
in each cycle 

ii) Keep track of subroutine addressing, and manage the return address 

stack 

111 ) Manage loops with little overhead, using on-chip loop counters) 

iv) Service interrupts from external devices 

v) Branch to an instruction, either conditionally or unconditionally, 
using a direct address or an address offset read through an external port <Most 
sequencers provide two-way and three-way jumps within a single instruction which 
can speed execution and ease programming by eliminating the need to code 
explicit loop testing) 

Commercially available program sequencers includes ADSP-1401, ADSP- 
1402, Am2910, IDT39C410, IDT49C4iO and IDT49C4il IDT39C4iO is pm compatible 
with Am29i0 but comsumes less power Both of them provide 12 bit address for 
microprogram memory A 33-deep LIFO stack provides microprogram subroutine 
linkage and looping capability A 12 bit counter is also provided for loop 


iteration and repeating instructions IDT49C410 is also functionally equivalent to 



Am29iO but has 16 bit wide microprogram memory address bus and a 16 bit loop 


counter The Am29i0 family lacks relative jump instruction and has no provision 
for hardware interrupts The IDT49C411 has a 20 bit wide address bus, a 20 bit 
loop counter, three independent 64-deep stack and can handle eight prioritized 
interrupts ADSP 1401 is much more versatile than the Am29i0 family It has 16 
bit address bus, four independent loop or event counter, ten prioritized 
interrupts and a 64 word stack with three stack pointers It can also execute 
relative jump and subroutine call ADSP 1402 is functionally equivalent to ADSP 
1401 and has some additional glue logic built in it which otherwise will have to be 
provided by the designer 

322 Adcfr'ess Generator 

Fast and flexible addressing of data is of import in digital signal 
processing application Data array scanning is one of the most common address 
manipulation in DSP Data array scanning can be sequential or decimated i e 
every second, fourth, etc , location may have to be addressed Further, most DSP 
lookup tables are circular buffers The address generator should be able to 
reset the pointer to the start address of the buffer when the pointer is equal to 
some preset boundary value For highly structured algorithms such as FFT, the 
data access may be decimated in time or in frequency and the address needs to 
be bit reversed These requirements calls for a specialized address generator 
capable of performing the above mentioned operations 

Of all the commercially available address generators ADSP-i4iO is a 
general-purpose address generator which has all the above said features Am2901 
IS a 4-bit slice which needs to be cascaded to get a adequate addressing range 
This chip lacks circular buffer and bit-reversed addressing feature The 


IDT49C402 is a super set of Am290i and is functionally equivalent to four Am290i 
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and one Am2902 Am29540 xs a special-purpose address generator which can 

generate both data addresses and twiddle-factor addresses for either radix-2 or 
radix-4 butterflies, bit reversed input or output, and decimation either in time or 
in frequency 

32 3 FFOs 

The first generation FIFOs were of the "register array" architecture 
and used "bucket brigade" arrangement to move data out of the FIFO [161 The 
second generation FIFOs use dual-ported static RAM in order to achieve a truly 
independent, asynchronous operation of the input and output The RAMs in the 
FIFO are internally addressed using two counters one for read and the other for 
write In addition, a bit is used for every FIFO word to designate which word has 
been written to but not yet read These FIFOs width and length can be increased 
by juxtaposing and cascading them with few extra logic FIFOs are available in 
various size from different manufacturers Listed below are some of them 
CY7C4i2/IDT720iA - 512 X 9 bits 

CY7C424/IDT7202A/IDT72021 - 1024 X 9 bits 

CY7C429/IDT7203 - 2048 X 9 bits 

IDT7204/72041 - 4096 X 9 bits 

32 4 FloatingHX)int ALU and Multipliers 

There is a heavy dependence on floating-point arithmetic for signal 
processing applications because of a potentially large dynamic range of values A 
discrete solution to a 32 bit floating-point processor can be at least one board 
of SSI and MSI circuits So it is preferrable that an IC or an IC set that 
implements the IEEE standard be used Further as DSP applications require high 


speed operation, the multiplier should use array multiplier and the multiplication 



operation should not span more than two or three stages of pipeline Table 3 i 
and 3 2 contains a list of available floating-point chipsets along with their 
salient features Some of these floating-point chip sets provide two mode of 
operations (i) pipelined and (ii) "flow-through” (i e , the pipelined registers made 
transparent) The tradeoff in using pipelining is that new values are loaded every 
clock cycles and one result is available every clock cycle, once the pipe is full 
!n the flow-through situation, one does not present any new operands to 
the inputs during the duration of the operating time 

32 5 Register File 

Register files are multiport memories that provides high speed local 
storage for data and flexibility in data transfer from one port to another 
Register files can also be considered as data cache, for they are generally used 


Table 3 1 32 bit floating-point multipliers 
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Table 3 2 32 bit f! 


■oating-point ALU 
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SP fsingle-precision floating-point), DP (double-precision floating-point), 
TC (twos-complement fixed-point), US (unsigned fixed-point), 

MM (mixed mode integer), Y (yes) 
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to hold data that are used frequently or will be used in the near future i e j 
spatial locality A register file provides temporary data storage and 
simultaneous input and output on several pieces of data, thereby expanding the 
computational bandwidth of the system As said earlier register file is a static 
RAM surrounded by latches and control logic needed for simple system interface 
Each port has its own address latch and read/write signals and is permanently 
configured as read only port or as write only port However it is possible that 
one or two port can be configured as read and write port Access to the memory 
froni all ports are time multiplexed and different ports are assigned different 
priorities Because of the plethora of data paths, many combinations of reads and 
writes are possible Comniercially available register files include the ADSP-3i28 
and WTL 1066 

3 3 FINALIZATION OF CELL ARCHUECTLSRE 

From the foregoing survey of DBF building bloclts we can see that 
Analog Devices provides a coniplete set of devices which are meant for high 
speed digital signal processing applications We have selected the following 
components for implementing the cell 
ADBF-i40i - program sequencer 
ADSF-14iO - address generator 
ADEF-3210 - floating-point multiplier 
ADSF-322D - floating-point ALU 
ADSF-3i2S - five port register file 
IDT7202A - 1024 X 9 bit FIFO 

The salient features of these devices are given in Appendix 4 

Although the configuration given in Figure 3 1 is capable of the basic 
operations needed by a cell some additional hardware like auxiliary meniory and 
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appropriate choice o-f interconnection would make the cell operation even more 
flexible 

3 3 1 Auxiliary Memory 

As stated earlier it might be necessary that each cell be assigned the 
job of more than one cell when the problem size is greater than the number of 
cells in the array Under such circumstances the cell might have to hold several 
partial results within itself If we use data memory to hold partial results, there 
will be a heavy traffic between the data memory and the functional unit Further 
as data memory can be used either for a read or a write operation during a 
single clock cycle the throughput will be affected To hold such data with high 
temporal locality it was decided to add a small memory connected directly to the 
ALU and the multiplier without increasing the size of the crossbar The auxiliary 
memory will also have a read-only portion for frequently used constants such as 
sine, cosine table for FFT and inverse table for division operation 

3 3^ Simulating Multiple Cells 

Two ways of mapping large size problems onto a fixed size array is 
given in Appendix i Both these methods require some extra hardware to facilitate 
the mapping and to preserve the properties of systolic array mentioned in 
Section 1 3 

Cut-and-pile mapping requires a link between the last cell and the first 
cell thereby forming a closed path This closed path is easily provided by 
including the IFU in the loop The IFU will receive data from the last cell and 
pass it to the first cell 

Coalescent mapping requires a loop back from the cell's output to it's 


input This loop IS required on the X and Y data link and can be easily 



implemented by using multiplexers at the X and Y queue inputs 


3 3 3 Register File ComectiCMi 

The register file (ADSP-3i28) that we have chosen has -five ports Port 
A and B are write only ports, port C and D are read only ports and port E is a 
bi-directional port Each port is 16 bit wide The ALU, multiplier and the register 
file can be connected together in a number of different configurations After 
evaluating different configurations we have selected the configuration shown in 
Figure 3 2 for it offers the most flexible interconnection between the ALU, 
auxiliary memory, crossbar, multiplier and the register file 
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Figure 3 2 Register file. ALU, multiplier and au;sili5ry meniory interconnection 
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3 4 FINAL CELL ARCHTTECTl^ 

Augmenting the cell architecture derived in Section 3 i with the 
refinements suggested in Section 3 3 the final cell structure is as shown in 
Figure 3 3 The cell uses 32 bit wide data paths throughout without any tradeoff 

between complexity and efficiency 


Crossbar 



Figure 3 3 Final cell architecture 










4. CELL DESIGN AND IMPLEMENTATION 


The SASP cell implementation is spread over three different modules 
The first module called the microengine contains the program sequencer and 
microprogram memory The second module called the data cache includes the 
register file and auxiliary memory The third module called the number -crunching 
unit houses the crossbar circuitry, the data meniory, the X and Y queues, the 
floating-point ALU and the multiplier Presently only the first and the third 
module has been implemented 

4 i MICROENGIhE: 

A block diagram of the microengine is as shown in Figure 4 1 The 
microengine consists of a program sequencer (ADSP-i40i), an address generator 
(ADSP-i4iO), host '‘interface circuitry, two control registers, host read/write 
synchronizing circuitry and 128 bit wide microprogram memory This module is 
assembled on three PCBs 

4 11 Host friterf ace 

The cell is mapped onto the host's (PC\XT) I/O addressing space A 
prototype adapter card plugged into the PC\XT decodes the address lines A15 to 
A5 (Figure 4 2) The decoded address generates the signal EN* which acts as the 
cell select signal ( In case of multiple cells this signal will be generated by 
decoding the cell address lines on the BC-bus ) The ENf signal is ORed with CRN* 
(i e , lOR* ANDed with lOW*'' to generate a new signal BEN* which is used to enable 
a transceiver (74L5245), which connects the PC data bus to the cell memories 
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Fiofure 4 1 Microengine block diagram 




















A 3-to-8 line decoder (74LSi38), located on the microengine cardj 


decodes the address line A4 to A2 to generate chip select signal -for different 
registers and transceivers This decoder is enabled by EN* signal The different 
chip-select signals and their usage are as listed below 


Signal 

Address (in hex) 

Usage 

CSO* 

FFEO to FFE3 

Microcode memory read/write 

CSi* 

FFE4 to FFE7 

Data memory read/write 

CS2* 

FFE8 to FFEB 

not used 

CS3f 

FFEC to FFEF 

not used 

CS4* 

FFFO to FFF3 

not used 

CS5* 

FFF4 to FFF7 

Cell control register (write only) 

CSSf 

FFF8 to FFFB 

Microprogram memory bank select register (write only) 

CS7* 

FFFC to FFFF 

Reading from 

Cell status register (read only) 

or writing to cell memories by the host requires 


synchronization between the host and the sequencer This then requires that wait 
states be introduced whenever the host accesses the cell memories The prototype 
card contains a buffer <74LS244’' which is enabled to pass the CRDY signal from 
the microengine card to the PC bus This buffer is enabled by RDVEN* signal 
which IS EN» ORed with A4 address bit, i e . wait states will be introduced for 
CSO^f, CSii, CS2f and CS3I For access to other registers this buffer is not 
enabled and it's output is pulled high through a pull-up resistor 

4 12 Control Registers 

There are two control registers in the cell (see Figure 4 3) They are - 
i) the cell control register and ii) the microprogram memory bank select register 
The cell control register is used by the host to put the sequencer in a known 


state and the bank select register is used to select different banks of the 
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microprogram memory while downloading the microprogram The signals associated 


with these registers and their description follows 


i) Cell control register 


Bit 0 

WCS 

Setting this bit to logic one causes the program sequencer 
instruction multiplexer to present a WCS instruction to the 
sequencer This bit should be set only when the host wants 
to download some microprogram 

Bit 1 

ENPL^f 

When this bit is set to logic zero, the pipeline register output 

IS enabled, otherwise it is tri-stated 

Bit 2 

CRESf 

When this bit is set to logic zero, the sequencer is held in 
reset state When set to logic one the sequencer starts 
executing from location 0000 

Bit 3 

HFLAG 

This bit can be read by the program sequencer at it's FLAG 
input and can be set/reset by the host to change the program 
flow 

Bit 4 

IRQ5 

Hardware interrupt from host to the sequencer 

Bit 5 

IRQ6 

Hardware interrupt from host to the sequencer 

Bit 6 

IR07 

Hardware interrupt from host to the sequencer 

Bit -7 

IRQ6 

Hardware interrupt from host to the sequencer 


ii) Microprogram memory bank select register 

Bit 0 

BANKO* 

Select microprogram memory bank 0 

Bit 1 

BANKif 

Select microprogram memory bank i 

Bit 2 

BANK2* 

Select microprogram memory bank 2 

Bit 3 

BANK3* 

Select microprogram memory bank 3 

Bit 4 


Not used 

Bit 5 


Not used 

Bit 6 


Not used 

Bit 7 


Not used 


4 13 Program Sequencer and Address Generator 

The microengine uses ADSP-i40i for sequencing the microprogram and 
ADSP-14iO address generator for data memory address generation The 
instruction for the program sequencer comes through the instruction multiple>ers 
(74LS257) (Figure 4 4 and 4 5) The multiplexers provides the WCS instruction when 
microprogram is to be downloaded (also see Section 4 15) During normal program 
execution the instruction multiplexers passes the control bits 0 to 6 from the 


microprogram memory to the program sequencer The address generator gets its 






































instructions directly from the control bits 24 to 34 


The data port of the program sequencer and the address generator are 
connected to the data field of the microprogram memory through multiplexers 
These multiplexers provides the start address (OOOG) during the microprogram 
downloading and passes the data field (control bits 8 to 23) when the sequencer 
is executing some microprogram The data field is 16 bit wide and permits loading 
of all internal registers of the program sequencer and the address generator 
with some desired value 

The data ports of the program sequencer and the address generator act 
as output port during the clock high period and as input port during the clock 
low period Since the data ports are driven only when the clock is high there 
must be a provision to hold this data for the remaining clock period if we want to 
transfer data between the address generator and the program sequencer This is 
done by providing a latch (74LS374) and a control signal CDXFER (control bit no 
35) so that the data is latched into the latch during the clock high period and its 
output IS enabled during the clock low period 

The eight interrupt lines are time multiplexed onto the four interrupt 
pins of the program sequencer This time multiplexing is done using a quad 2-to-i 
line multiplexer 74LS257 Out of the eight interrupt lines the higher priority four 
lines are assigned to the host while the lower priority four interrupt lines can 
be used by the cell's functional units 

4 14 Sequencer Clock Frequency 

To calculate the operating frequency of the microengine one has to 
take a look at the sequence of events which occurs during a microprogram memory 
instruction fetch As shown in Figure 4 6a it involves setting up program 
sequencer's instruction, waiting for valid address from the sequencer, accessing 
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the microprogram memory location, setting up the currently accessed instruction 
and so on 

The worst case instruction set-up time and output delay time for ADSP- 
14D1 are 35 nsec each The memory (HM6264LP-15) used for microprogram storage 
has a maximum access time of 150 nsec Therefore the smallest possible cycle 
time will be tcyc = <35+i50+35’> = 220 nsec This gives the maximurri operational 
frequency o-f 4 5 MHz 

Unfortunately this cycle time is not sufficient for data transfer from 
the data memory to the ALU ''multiplier registers, that loads the data on the 
falling edge of the clock The sequence of events for this is as shown in Figure 
4 6b The address generator’s clock-to-output delay has a typical value of 40 
nsec The data memory has a maximum access time of 150 nsec The data set-up 
time for the ALU/multiplier has a minimum value of 20 nsec The crossbar 
introduces an additional delay of 20 nsec Taking all this into account we get 
|tcyc = (40+150+20+20) = 230 nsec Therefore the cycle time turns out to be 460 
nsec In the present design we are using a 4 192 MHz crystal which is divided by 
two before being fed to the sequncer This frequency gives a cycle time of 476 
nsec 

4 15 Microprogram Memory 

The present configuration of the cell requires 96 control bits To 
maintain modularity the microprogram memory is distributed onto different PCBs, 
each PCB containing 64 control bits The present design has two such boards and 
are called the microprogrsm memory bankO and bank! 

The microprogram memory circuit is as shown in Figure 4 7 to 4 9 It 
consists of HM6264LP-i5 (8KX8 bit SRAM), transceiver 74LS245 and the pipeline 


register 74LS373 and few other logic The transceiver is used to connect the 
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memory to the PC data bus while downloading The pipeline register's input is 
connected to the memory data bits The pipeline register is clocked by the 
sequencer clock and is used to hold the control signals for one complete clock 
cycle This module of one memory chipj a transceiver and a pipeline register is 
replicated eight times on each PCB thereby yielding a total of 64 control bits per 
microprogram memory bank 

The microprogram memory can be accessed either by the host or by the 
program sequencer When accessed by the program sequencer, the sequencer is 
executing some microprogram Before the start of microprogram execution the WCS 
bit in cell control register is set to logic zero by the host All the memory chips 
in different banks are perntanently enabled and are in read mode The address 
lines MAO to MAi2 provides the address to the microprogram memory and address 
lines MA13 to MAiS are ignored Every clock cycle the sequencer outputs an 
address The content of this memory location are latched in the pipeline register 
and IS held for one complete clock cycle 

In the second case the host accesses the microprogram memory to 
download the microprogram Before accessing the memory the host has to set the 
WCS bit to logic high The host data bus being 8 bit wide the downloading has to 
be done sequentially one memory chip after the other During microprogram 
downloading the address lines MAi3 to MAi5 are decoded using a 3-to-8 line 
decoder (74LSi38) to provide chip selects to one of the eight memory chips at any 
time (see F igure 4 8 and 4 9) The decoder itself is enabled by BANKx* control 
signal, which is activated by the host by setting the appropriate bits in the bank 
select register The same chip select signals are also used to enable the 


transceiver thereby connecting the meniory chip to the host data bus 
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4 16 Synchronization Circuit 

Whenever the host wants to write to the microprogram memory the 
program sequencer provides the memory address and similarly while writing to or 
reading from the data memory the address generator provides the memory 
address Since the host may supply data at a rate different from the program 
sequencer's clock rate, some handshake is required This handshake signal should 
assert the FLAG input of the program sequencer after each transfer The 
program sequencer will increment the address if at the rising edge of the clock 
it finds a logic high at its FLAG input 

The circuit which generates this synchronizing flag (SFLAG) signal is as 
shown in Figure 4 10 As explained in Section 4 11 whenever RDYENtI line goes low 
the FDY line on the PC bus is pulled low thereby introducing wait states 
Simultaneously the lOWf ANDed with RE'YENtK is synchronized with the sequencer 
clock using a D-type flip-flop The synchronized host write signal (SIOW*’' is fed 
to two D-type flip-flops in cascade which act as "one-shot" The output of this 
one-shot is used as host write signal (HWRf) This is done to ensure that the 
address lines are stable during the rising edge of the write pulse The HWR* 
signal is delayed by | a clock cycle to generate the SFLAG signal The rising edge 
of SFLAG signal also pulls the RDY line high thereby removing the wait state in 
the PC machine cycle A similar arrangement is also provided to generate the 
SFLAG during host read cycles The timing diagram for the synchronizing circuit 
IS as shown in Figure 4 li 

42 DATA CACHE 

The cell provides for high intercell bandwidth by using two queues X 
and Y To maintain a high bandwidth internal to the cell the functional units are 
connected through a crossbar In order to limit the traffic through the crossbar 

Cf'— '■ . — -RY 
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the usage of register file and auxiliary memory was suggested in Section 3 16 
and 3 3 1 respectively 

The register file increases the communication bandwidth by using 
multiport memory The register file connection suggested in Section 3 3 2 
provides - i) easy flow of data between the ALU and the multiplier, ii) delay 
between the transfer of the output of the ALU to the input, iii) buffering for the 
data to be written to the data memory 

The auxiliary memory provides a temporary storage for data with high 
temporal locality This feature will be useful when a single cell has to perform 
the task of multiple cells The auxiliary memory also holds frequently used 
constants like sine, cosine and reciprocal tables 

Due to the large number of control bits (68) required and the 
difficulties encountered in assembling the register file chips this module is not 
implemented in the current design of the cell Presently the ALU and the 
multiplier are connected directly to the crossbar This modification is explained 
in the ne>t section This section describes a possible design for the data cache 
module 

4j2 1 Register File 

The ADSP-3i28 is configurable via the DP control pm as either a 
128X16 bit or a 64X32 bit register file In the present design two such chips are 
"paralleled" to yield a 128X32 bit storage Each chip has five data ports A, B, C, 
D and E Data ports A and B are write-only ports, data ports C and D are read- 
only ports and the Edata-port is bidirectional These five data ports allows upto 
SIX data transfer per clock cycle 

Addresses presented on the Aadr, Badr and Eadr lines during clock high 


are taken as the write addresses Sinularly addresses presented on the Cadr, 
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Dadr and Eadr lines during clock low forms the read addresses The write 
address latches can be made to latch the address on clock' high period or 
transparent using the control pm Wadtrn Similarly the read address latches can 
be transparent or can register the read address during the clock's rising edge 
This can be done by making the Radtrn pin high and low respectively 

The register file allows two different types of write - normal and flow- 
through A normal write occurs only during the clock high period The Adata-port 
and Bdata-port input latches can be set to transparent, latched or clock -on- 
falling edge mode via the ABlt and ABht control pins Similarly the Edata-port 
input latch can be set to transparent^ latched or clock -on-falling edge mode via 
the Elt and Eht control pins These control pins are always registered on the 
rising edge of the clock and become effective as of the next falling edge This 
mode allows upto a maximum of three reads and three writes par clock cycle 

The write flow-through mode allows a write to RAM and read from the 
same RAM location in clock low period If no read address matches the write 
address during flow-through mode, the write will not take place Write flow- 
through mode IS enabled using the Awft, Bwft and Ewft control pins for A, B and E 
ports respectively During this mode of operation the write address latches 
should be made transparent and the read address latches can be either 
transparent or registered The write flow-through mode allows two flow-through 
writes, one normal write and upto three reads or one flow-through write, two 
normal writes and upto three reads per clock cycle 

Each write port has an independent write inhibit control (Awinh, Ewinh 
and Ewinh) These control pins should be driven high whenever write operation is 
not desired The write ports are prioritized for both the write modes with the 
Edata-port having the highest priority, followed by the Adata-port followed by 


the Bdata-port If write to the same RAM location are attempted in a given clock 
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high period, the data presented at the higher priority enabled data port will be 
written to RAM 

Read operation from register file has also two modes - normal and extra 
read In normal read three different locations can be read during the clock low 
period using the C, D and E port The data latch for these ports can be made 
transparent or set to register on the clock's rising edge This is done by driving 
the CDtran/Etran pin(s) high and low respectively E>tra read from the Cdata-port, 
Ddata-port and or Edata-port can also be done during the clock high period Extra 
reads are unusual in that they occur during the clock high period which is 
normally a write phase Therefore each extra read takes place in lieu of one 
write operation to the register file This feature is enabled by making the output 
latches transparent i e , Rfltran and CDtran/Etran pins are driven high, for the 
entire clock cycle This mode allows two 16 bit words to be read from each chip 
per cycle through the C. D and E ports The Aadr specifies the word to be read 
through the Cdata-port in clock high period while the Badr does the sane for 
Ddata-port Normal read operation allows a maximum of three reads during the 
clock low period In the extra read mode the register file allows four reads and 
two writes, five reads and one write or six reads per clock cycle Each port has 
also its own asynchronous three-state control Ctri, Dtri and Etri 

All the control and address pins of both the register file chips are 
connected together The address line of all the ports will be driven by the 
control bits in the microprogram memory As the ALU and the multiplier put 
together have only three input ports extra read in a clock cycle will be of no 
use Therefore the control pm Rfltran is permanently grounded Further, as the 
Cdata-port and the Ddata-port are driving only the input port of the ALU and the 
multiplier they are enabled permanently by grounding the Ctrl and Dtri pins The 
ES pin serves as a chip-select -for the register tile and is permanently pulled 
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high The circuit diagram for register file is as shown in Figure 4 12 

A22 Auxiliary Memory 

The auxiliary memory is divided into two parts One part of the 
auxiliary memory is implemented with PROM (Am27PS28i> and is used to hold 
constants like sine, cosine and reciprocal tables The other part of the auxiliary 
memory uses static RAM <6116) which serves as a scratch-pad memory 

The RAM part of the auxiliary meniory is 2048 word long and each word 
IS 32 bit wide This RAM is addressed using a separate address generator, ADSP- 
1410 The RAM is enabled whenever the AMS2 control bit is high Separate read 
and write control signal are provided from the microcode The data port of the 
address generator is connected to the data field of the microprogram memory (see 
Figure 4 I"* This allows for initializing the internal registers of the address 
generator The address generator gets its instruction from the microprogram 
memory Figure 4 12 gives the complete circuit diagram for this part of the 
auxiliary memory 

The PROM part of the auxiliary memory is 1024 words long and holds the 
constant tables The first 256 locations holds the sine table, the ne)t 256 
locations holds the cosine table, the next 256 locations holds the reciprocal 
table and the last 256 locations are not used presently All these values are 
stored in single-precision floating-point format with the least 15 bits of the 
mantissa set to zero Unlike the RAM portion of the auxiliary memory, the PROM 
portion is only 16 bit wide and is split into two parts - one for the exponent and 
one for the mantissa The two portions of the PROM are addressed separately 
using latches whose inputs are connected to the outputs of the ALU and the 
multiplier through a niultiplexer One latch is used to the latch the exponent part 


of the 32 bit floating-point number and addresses the exponent PROM The other 
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latchj latches the eight most-significant bits of the mantissa and is used to 
address the mantissa PROM The remaining bits of the mantissa are filled with 
zeros The sign bit is passed on unmodified Figure 4 13 illustrates how this 
mapping is achieved 

To select the sine, cosine and the reciprocal table from the PROM two 
more control bits (AMSl and AMSO) are used The AMSi control bit is connected to 
the A9 address pin of the PROM and the the AMSO control bit to the A8 address 
pin The table selected depends upon these two control bifs and is as listed below 


AMS2 AMSi AMSO 
0 0 0 

0 0 1 

0 i 0 

Oil 

1 0 0 

i 0 1 

i i 0 

ill 


Usage 

Sine table selected 

Cosine table selected 

Reciprocal table selected 

Both PROM and RAM are deselected 

RAM selected 

RAM selected 

RAM selected 

RAM selected 


Detailed circuit diagram of this is given in Figure 4 14 



Figure 4 13 Constant table address mapping 





















4 3 NUMBEH-CRUNCHING UNIT 


The number-crunching consists of ALU, multiplier, data memory, X queue, 
Y queue and the crossbar circuitry The ALU and the multipler are mounted on a 
PCB The crossbar, data memory. X queue and the Y queue are assembled on two 
wire-wrap boards 

4.3 i ALU and Multiplier 

The ALU used in this cell is ADSP-3220 The ALU has two input ports 
(AIN & BIN) and one output port (DOUT), all 32 bit wide Inside the ALU there are 8 
registers, 4 of which can be loaded from AIN port and the other 4 front BIN port 
8 control pins are provided which when asserted loads the corresponding 
registers with the data at AIt.J''BlN ports The register selection for loading the 
operands is done using the control bits Si to 88 As shown in Figure 4 15 the BIN 
port of the ALU can be directly loaded from the crossbar output or from the 
output of the multiplier through a multiplexer This input selection is done using 
control bit 89 The ALU has 9 bit wide instruction port which are fed from 
control bits 72 to 80 Along with the instruction the user has to specify which 
register contains the operand(s) This is done by using the RDAO, RDAi, RDBO and 
RDBi pins of the ALU These pins are controlled by the control pins 30 to 93 
Round mode control pins RNDO and RNDl are permanently connected to give round- 
to-nearest mode operation For some operations the output can be 64 bit wide The 
lower or upper 32 bits of the result can be output on the DOUT port by using 
control bit 94, which is connected to MSWSEL pin of the ALU 

The cell uses ADSP-32iO multiplier This multiplier has a single input 
port fDIt'l) and one output port (DOUT) There are 4 registers inside the multiplier 
which can be loaded from the DIN port The regis(er(s) to be loaded is selected by 


asserting the EELAO SELAi, SELBO and SELBi pins These pins are connected to 
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the control bits 64 to 67 Unlike the ALU the multiplier initiates one 
multiplication every clock cycle and has no instruction set The operand register 
IS selected using control bits 68 and 69 which are connected to the RDAO and 
RDBO pins of the multiplier Control bit 70 connected to the SP pin of the 
multiplier selects the type of operand, i e two's complement or 32 bit floating- 
point number As in the ALU, control bit 7i selects the least significant or the 
most significant word of the product 

4 3 2 X Queue and Y Queue 

The FIFO nDT7202A'' used for X and V queue is 1024 word long and 9 bit 
wide The queue width is increased to 32 bit by connecting the input control 
signal of four such chips The FIFO has four input control signals and three 
queue status signals ^Figure 4 IB' 

The input control signals are - Read (Rf), Write (WI), ReSet (RSi) and 
ReTransmit (RTf) The read and write signals are used for reading from and 

writing to the queues The RS* signal is used to reset the FIFO Asserting this 

signal resets both the read and write pointers to the first location The 
retransmit control signal, when asserted sets the internal read pointer to the 
first location and does not affect the write pointer This feature will be useful 
when the same data is to be used a number of times 

The three status signals are - Full Flag fFF*), Empty Flag (EFif) and 
Half -full Flag (HF*) The FFH? goes low whenever the write pointer is one location 
from the read pointer or when the read pointer is not moved and 1024 writes has 
taken place This flag is passed on to the adjacent celKs) to indicate the queue 
status The EF* goes low inhibiting further read operations when the read 

pointer is one location from the write pointer, indicafing that the queue is empty 

This signal is used by the sequencer in the cell while reading from the queue 
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The HF* output is asserted and remains so until the di-fference between the write 
pointer and the read pointer is less than or equal to one half of the FIFO size 
1 e , 5i2 This signal is not used in the present design 

As shown in Figure 3 2 the X queue of cell^ can be written by cell^. 
or cellj^ itself The multiplexer provided at the input does this multiplexing 
Similarly the Y queue of cell^ can be written by cellj^.j^ or cellj^^j_ or cell^ itself 
A 4-to-l line multiplexer at the input of Y queue multiplexes all these inputs The 
output of both the queues are connected to the crossbar inputs which feeds into 
the ALU and the multiplier 

Nonavailability of the FIFOs has forced us to use the TTL register file 
(74170) temporarily This register file is organized as four words of 4 bits each 
Eight such chips are used to form a 32 bit wide queue (see Figure 4 17) 

Four data inputs are a^'ailable which are used to supply the 4 bit word 
to be stored Location of the word is determined by the write address in 
conjunction with the write enable signal When the read address is made in 
conjunction with the read-enable signal, the word appears at the four outputs 
Separate on-chip decoding is provided for read and write permits This permits 
sifiiultaneous read from and write to the same location 

The outputs of the X queue and the Y queue are connected to the 
crossbar The inputs to the X queue and the Y queue are connected to the output 
of the ALU and the multiplier through a 2-to-l line multiplexer (Figure 4 2i) This 
enables loading of data into the X queue and the Y queue without using the 
interface unit When the interface unit and more cells are added this input 
multiplexer will have to be modified accordingly 

4 3 3 Data Memory 

The data memory circuit is as shown in Figure 4 18 and 4 15 It consists 
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of four HM6264LP-i5, SK-by-S SRAM to form a 32 bit wide data memory Also 
connected to the data memory are four transceivers (74LS245> which allows 
connection of 8 bit PC data bus to these memory chips so that the host can read 
from or write to the data memory The data memory can also input data from the 
ALU and the multiplier through a 2-to-i line multiplexer (Figure 4 2i) Also 
included are some logic which allows the host and the cell to access the data 
memory without conflict This logic is also used to multiplex the read and write 
signals of the host and the cell 

The signals which control the sharing of the data memory by the host 
and the cell are DMEN (control bit 45) and DDL (control bit 44) 

When the cell accesses the data memory, DMEN=i and DDL=0, all the 
memory chips are enabled simultaneously during a read or write cycle During the 
cell access the read and write signals are obtained from control bits 42 and 43 
respectively As the size of the data memory is only 8k., address lines DMAO to 
DMAi2 are used and the higher address bits are ignored 

When the host accesses the data memory, DMEN=0 and DDL=i, the address 
lines DMA13 to DMAiS are decoded using a 3-to-8 line decoder to provide chip- 
selects to one memory chip at a time The same signal is also used to enable the 
transceiver, so that one memory chip is connected to the host data bus Before 
the host access the data memory it has to download and execute a microprogram, 
which will initialize the address generator The flow chart for this program is as 
shown in Figure 4 20 Here the synchronization between the host read/write and 
the sequencer clock is again achieved using the SFLAG signal as explained in 
Section 4 16 The program sequencer does not increments the program counter of 
the sequencer unless it senses a logic high at its FLAG input During this time 
the address generator simply outputs the address held in the register RO When 
the flag input is asserted the program sequencer executes the next instruction in 
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the next clock cycle In this clock cycle the address generator increments the 
contents of the RO register by one while the sequencer loops back where it again 
waits for the FLAG input to be asserted 

4^ 4 Crossbar 

The crossbar is used to interconnect the ALU, multiplier, data memory 
and the queues (see Figure 4 2i) The crossbar is assembled using 32 dual 4-to-l 
multiplexers (Alternatively the crossbar can also be implemented with PALs or 
digital crosspoint switch Crosspoint switches have fixed number of ports all of 
which may not be needed in some application Further many such chips will have to 
be "paralleled" to obtain the required bus width On the other hand PALs can be 
programmed for the required interconnection) From hardware point of view the 
crossbar can be separated into two different units called the xbari and xbarZ 
Each of these crossbars has four input ports and one output port The outputs of 
X queue, Y queue and the data memory are connected to the three different input 
ports of both the crossoars 

The output port of xbari is connected directly to the Alt-l input port of 
the ALU The output port of xbarZ is connected to the DIN input port of the 
multiplier and also to the BIN input port of the ALU through a multiplexer 

As the fourth input port in both the crossbar is not needed by the 
queues and the data memory, it is utilized to provide some constant literal inputs 
An analysis of the distribution of constants C51 indicates that 0 and 1 are the 
most frequently used ones The fourth input of xbari holds the value 0 and that 
of xbarZ has the value 1 

Routing of the inputs to the output is done by two control signals for 
each of the two crossbars Control bits 58 and 59 select the output of xbari and 
similarly control bits 60 and 6i select the output of xbarZ 
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5. CELL PROGRAMMING 


Hand coding o-f a microcode based system with a large number of control 
bits IS a taxing job Appropriate software tools liite meta-assembler is a must 
before a user can program such a system This chapter describes the instruction 
set of the SASF cell, followed by the software utilities developed for using the 
cell Finally this chapter concludes with an example of matrix-matrix 
multiplication program for the cell 

5 i CELL INSTRUCTICN SET 

As in any microcoded system, an instruction word of the SASF cell is 
split into multiple fields Each field controls different functional blocks in the 
cell To reduce the cycle time the microcode is organised horizontally without 
any further decoding of the control bits This approach has resulted in a wide 
microcode memory (96 bits'* and a large set of opcodes Not all of these opcodes 
are valid instructions 

The current implementation has one 16-bit wide data field and 23 
instruction fields These instruction fields define 220 basic instructions 
Appendix 3 contains a list of all the valid instruction's mnemonic, opcode and it's 
description in brief The listing starts with the field name and the default 
instruction for this field which is then followed by all the instructions for this 
field 

52 . SOFTWARE UTILITES 


A few utility programs have been developed which are used to program 
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and debug the hardware A brief description of these utility programs is given 
below along with the program name 

MEASM£XE This program, a meta-assembler developed by San jay [221, is 
used to assemble the microprograms This assembler is a two-pass assembler In 
the first pass it processes a -file containing the definition of the field format of 
an instruction word in the cell In the second phase it processes the user 
program translating the mnemonics, input symbols, etc , into a sequence of binary 
microinstructions Fields for which the user has not specified any instruction 
the meta-assembler automatically includes the corresponding default instruction 
The output of the meta-assembler is the microprogram in object code form 

ARRMC COM The output of the meta-assembler is a string of i's and O's 
The length of this string equals the number of control bits in use Before 
downloading the microprogram to the cell the user has to rearrange this OBJ file 
in columnar fashion so that different columns of the object code resides in 
different memories in the microprogram memory bank This rearrangement of the 
object code is done by this program The inputs to this program is the OBJ file 
and the number of microcode memory chips This program then creates the output 
files MCEy DAT where x equals 0, 1. etc MCBO DAT contains the microcode which is 
to be downloaded to BANK’D, MCBi DAT contains the microcode which is to be 
downloaded to BANK! and so on 

ARF?D COM The data memory of the cell is 32 bit wide while that of the 
host (PC ''XT) IS 8 bit wide Transfer of data to the cell from the host has to be 
demultiplexed as described in Section 4 2 3 So before transfering the data has to 
be formated in a columnar manner This is done by ARRDM COM The output of this 
program is DM IN file which can be transferred to the cell as described in 
Section 4 2 3 


RDDMTF COM The host while reading the data from the data memory 



reads 8 bits at a time After reading the data from the four memory chips of the 
data memory, these data should be collated to form a 32 bit data This program 
first reads the entire 8192 bytes from each of the four memory chips in the data 
memory and then collates the first 512 data only (As the full data memory was 
not used by any program, only 512 data were collated This limitation can be 
removed by changing a parameter in the source file) The 32 bit data is sent to 
the output file DM OUT 

CC (DOM This program acts as a controller for the cell Running on the 
host this program can be used for the following purpose - 

i) To download the microprogram i e , transfer MCBx DAT files to the 
cell 

ii; To download the data i e , transfer DM IN file to the data memory 

111 ) To reset /halt the sequencer 

iv) To start execution of a microprogram 

V/ To fill different microcode test patterns in the microprogam memory 
(this feature was used while debugging the hardware in the 
microengine and the microprogram memory bank cards) 

CELL DAT This file defines the field formats, the field position in the 
microinstruction and valid instructions in each field for the cell currently in 
use This IS a non-executable file and is used by MEASM EXE during the definition 
processing phase 

5 3 MATRDC-MATRDC MULTTPUCATION AN EXAMPLE 

Consider two matrices A and B of size M X N and N X K respectively 
Their product C (=AB'‘ is a matrix of size M X K" The program for this is given 
below The program starts by initializing the address generator's regisers RQ and 
R1 Register RO points to the elements of B array which is stored in row-maior 
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order in the data-memory As the IFU is not used currently, the elements of A 
matrix is initially stored in the data memory Before starting the computations 
the elements of A matrix are moved from the data memory to the X queue through 
the ALU This was done to create the systolic effect without using the IFU The X 
queue holds the elements of A matrix in column-major order The partial results 
are stored in the Y queue Final results are stored, both in the Y queue and the 
data-memory Register Ri points to the location where the final results will be 
stored in column-major order After completing the execution the program 
sequencer waits for the host to read the results from the data memory 

As both the queues are four word deep the size of the A and B matrices 
IS limited to 2 X 2 This matrix -matrix multiplication involves 6 multiplications 
and 4 additions and the number of cloci cycles taken to complete them equals 15 
Figure 5 1 shows the detailed activities for these 15 clock cycles Since the 
matrices are small in size almost half the clock cycles are lost in filling and 
draining the pipe From this figure we can see that starting from fifth clock 
cycle the multiplier outputs one product every cycle and similarly the ALU 
outputs the sum of two operands every clock cycle starting from the eigth clock 
cycle As another example consider the multiplication of A and B matrices of size 
10 X 10 each This involves 1000 multiplications and 900 additions which can be 
completed in 1007 clock cycles if the same type of pipelining as shown in Figure 
5 1 can be achieved Then at the present clock frequency of 2 096Mhz the 
computation rate equals 4 MFLOPS 

, matrix-matrix multiplication program listing 
start cont & rst clrcntr S' amres 

disir S' dti<i0) & dsel S dbenb S' 3 7 
dti(il) S' dsel S' dbenb S 3 16 
itr^iOXrO) , initialize RO register 
itr(ilXrl) , initialize Rl register 
dtcr S( dbenb S' dsel S 3 0 


, transfer the elements of A matrix to X queue 
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litxbi & laaOr & yrtr(rO) & dmtxb,AO = "0 0" 

dmtxb & dmtxbZ & labZrx & yrtr(rO),B2 = (RO) 

passa & aorCbZaO] 

passb & aoKbZaO] 

passa S' aorCbZaO] 

passb & aorCbZaO] S- atyq(wO) 

passa S. aorCbZaO] & atxq(wO) 

passb & aorCbZaO] Sc atyq(wi) 

passa Sc aorCbZaO] S' atxq^wl) 

passb S' aoKbZaO] S’ atyq(wZ) 

atxq(wZ) 

atyqfwS) 

atxq^wS^ S. amres 


, start matrix multiplication 
itr(iO)(rO) 

enbcntr S. ImraObO 8 xqtxb(rO) Sc xqtxbZ S' yinc(cO)(rO) 
enbcntr & dmtxb & dmtxbZ & flpmlbOaOl 

enbcntr Sc Imraibi 8 xqtxb(rl) S' xqtxbZ 8- yinc(cOKrO’i 8- flpmCbOail 
enbcntr 8 dmtxb 8 dmtxbZ & yqtxb(rO) 8 yqtxbi & flpmCblaOl 8 
laaOr 8 labirm 

enbcntr 8 ImraObO 8 xqtxbCrZ) 8 xqtxbZ 8 yqtxWrl’* 8 yqtxbl8 noset 8- 
yinctcOXrO) S- flpmCbiall 8- laaZr 8 labSrm 8 aorlbiaOl 8 sadd 
enbcntr 8 dmtxb 8 dmtxbZ 8 yqtxb(rZ) 8 yqtxbi 8 -flpmCbOaO] 8 nose*' 8 
laaOr 8' labirm 8 aorCbSaZl 8 sadd 
enbcntr 8 Imralbl 8 xqtxb(r3) 8 xqtxbZ 8 yqtxWrS) 8 yqtxbi 8 

yinc(cO)(rO> 8 flpmCbOall 8 laaZr 8 labSrm 8 aoKbiaOl 8 sadd 
enbcntr 8 dmtxb 8 dmtxbZ 8 yqtxb(rO’> 8 yqtxbi 8 flpmCblaOl 8 
laaOr 8 labirm 8 aor[b3aZ3 8 sadd 8 atyq(wO) 
enbcntr 8 yqtxbfri) 8 yqtxbi 8 flpmtbiall 8 laaZr 8 labSrm 8 
aorCbiaO] 8 sadd 8 atyq(wi> 

enbcntr 8 yqtxb(rZ) 8 yqtxbi 8 laaOr 8 labirm 8 aorCbSaZl 8 
sadd 8 atyq(wZ) 

enbcntr 8 yqtxb(r3) 8 yqtxbi 8 laaZr 8 labSrm 8 aorCbiaOl 8 
sadd 8 yinc<cO)(rl) 8 atyq(w3) 
enbcntr 8 aorCbSaZl 8 sadd 8 yinc(cO)(rl) 8- atdm 
enbcntr 8 yinc<cO)(ri) 8 atdm 
enbcntr 8 yincteOKrl) 8 atdm 
enbcntr 8 yinc(cOKri) 8 atdm 


, wait for host to read the results 

yrtrCrO) 8 dbenb 8 3 0 8 dsel 8 hrw 
wait_for_host 

jpcnf 8 flag? 8 yrtr(rO) 8 noset 8 hrw 

jda(unconditional) wait_for_host 8 dbenb 8 yincfcOArO) 8 hrw 
jda(unconditional) wfh 8 dbenb 
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6. CONCLUSIONS 


A systolic array iS a cost-effective solution for compute-bound 
problems It combines the high throughput of an array processor with pipelined 
intercell communication Systolic arrays are ideal for fine-grain and large-grain 
problertis Such a systolic array (SASP) was defined by Nemawarkar [193, and 
modified futher by Samit C2i] and Usman Cl'’] for signal processing application 
SASP IS a linear array of microprogrammable cells connected to the external host 
(PC/XT) through an interface unit 

In this thesis the cell architecture has been laid out in greater detail 
and a major part of it has been iniplemented using SSI MSI chips and few special- 
purpose ICs The cell uses 32 bit wide data paths throughout Pipelined floating- 
point ALU and multiplier are used to provide higher processing rate and a wide 
dynamic range for data Currently the cell operates at a much lower frequency, 
than its rated frequency of iO MHz, which would be possible provided fast 
memories are used Various functional blocks of the cells were tested using 
different microprograms Software utilities were also developed to ease the task 
of programming the cell 

The small size of the X and Y queue has been a limiting factor for 
trying out different algorithms Operating at a clock frequency of 2 096 MHz the 
maximum throughput of the cell can reach upto 4 MFLOPS Higher processing rate 


can be achieved by increasing the clock frequency and using high-speed memories 



aXJGESTICJNS FOR FURTHER WORK 


i) A 32 bit IFU has been developed parallelly [231 The next step shoulc 
be to interface the cell and the IFU With the IFU the limited size of the X and Y 
queues can be circumvented and the full capability of the cell can be exploited in 
a true systolic manner for a variety of applications 

ii) More algorithms should be developed and executed on this one cell 
systolic array to find the limitations of the current cell architecture Based on 
the experience gained the cell architecture may be refined and the cell should be 
replicated to form a large size array 

111 ) Compiler which can hide the low-level details of the hardware and 
let the user work at a much higher level should be developed 

iv) High-speed memories and FIFOs should be incorporated so that the 


cell can be operated at its maximum frequency 
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APPENDIX 1 


PROBLEM PARTITIONING TECHNIQUES 


Clock skew puts an upper limit to the number of cells that can be 
incorporated in an array When the number of cells in the array is less than the 
PEs in the algorithm, it is necessary to carry out an additional space-time 
mapping The need for space-time mapping can also be validated for the reason 
described below 

The time required to compute N components oi A=BX+C. where B is an 
by-M matrix, and X and C are column vectors with dimensions M and N respectively. 
15 of 0(2N) Similarly to compute components of D=EF4-G, where E, F and G are 
respectively M-by-N N-by-P and M-by-P matrices is of 0(3N) [73 For matrix-vector 
multiplication, every cell is active only in alternate time intervals, and for 
matrix-matrix multiplication on the hexagonal array, in any row or column, out of 
any three consecutive PEs, only one is active at any given time If we can perform 
the operation of multiple cells on one PE a substantial saving in hardware can be 
achie v^ed 

There are two types of spatial mapping coalescent and cut-and-pile 

IL31 

1) Coalescent mapping In this mapping, cell^ for i i ^ c is rr«apped to 
PE|^, where k = [i/I'c/L]] for L and where L is the number of PEs available In 

coalescent mapping (Figure Ai [c^L] consecutive cells are assigned to one PE 
so the PE requires feedback links to itself Each data item that has entered into 
a PE remains in it for [c L] cycles The drawback o+ this mapping is that the 



computation load is not uniformly distributed among the PEs 

2) Cut-and-pile mapping This mapping maps cell^ to PE|^, where k=i+(i-i) 
mod L (Figure Ai 2) Each PE is functionally equivalent to a cell in the array One 
feedback link between the first and last PEs is required The feedback loop 
should have memory of size proportional to fc/L] This mapping distributes the 
computing load evenly among the PEs and therefore the computing time is smaller 



PEI P£2 PE3 PE:4 


Figure Al i A linear systolic array with coalescent mapping 



PEI PE2 PE2 PE4 


Figure Al 2 A linear systolic array with cut-and-pile mapping 










APPENDIX 2 

CONTROL SIGNALS AND THEIR FUNCTIONS 


This appendix lists the control signals assigned to different bits of 
the microprogram memory word and their function in brief Also listed are the 
control bit numbers which will help the user in hardware debugging 


Bit no 

Signal name 

Function 

0 to 6 

SI0-SI6 

Program sequencer instructions 

7 

DBEN 

Enables the output of the mux connected to the data 
port of the program sequencer and the address 
generator 

8 to 23 

DATA 

These 16 bits holds the data which can be loaded in the 
internal registers of the program sequencer and the 
address generator 

24 to 33 

AG0-AG9 

Address generator instructions 

34 

DSEL 

Asserting this signal causes the data set up on the 
data port of the address generator to be transferred 
to the register specified in the instruction 

35 

CDXFER 

Enables the latch which facilitates the transfer of 
data between the sequencer and the address generator 

36 to 38 

FS0-FS2 

Selects one of the eight flag inputs to be passed onto 
the FLAG input of the sequencer 

39 

SETST 

Set cell status ( This signal should be used as a chip- 
select for the cell status register, but currently is 
being used to drive a LED ) 

40 

DMIMUXOE* 

Enable the output of the multiplexer, which passes on 
the output of ALU/multiplier to the data memory 

4i 

DMIA»/M 

Select ALU or multiplier output to be written into the 
data memory 

42 

DMCWR 

Data memory write signal from cell 

43 

DMCRD 

Data memory read signal from cell 

44 

DDL 

Set data memory for host download/upload 

45 

DMEN 

Enable all data memory simultaneously for cell access 

46 

XQIA*/M 

Select ALU or multiplier output to be written into the 

X queue 

47 to 48 

XQWA,XQWB 

X queue write address 

49 

XQWR* 

X queue write signal from cell 

50 to 5i 

XQRA,XQRB 

X queue read address 

52 

YQIA*/M 

Select ALU or multiplier output to be written into the 

Y queue 








S4. 

53 to 

54 

YQWA,YQWB 

Y queue write address 

55 


YQWR^f 

Y queue write signal from cell 

56 to 57 

YQRA,YQRB 

Y queue read address 

58 to 

59 

X150,X1S1 

Xbarl input select signals 

60 to 

61 

X2S0,X2S1 

Xbar2 input select signals 

62 


LOADCNTR 

Sets the clock counter to zero 

63 


ENBCNTR 

Enables the counting operation of the counter 

64 


MSELAO 

Load multiplier's AO register from DIN port 

65 


MSELAl 

Load multiplier's A1 register from DIN port 

66 


MSELBO 

Load multiplier's BO register from DIN port 

6^ 


MSELBl 

Load multiplier's B1 register from DIN port 

68 to 69 

MRDA0.MRDB0 

Selects the operand registers for multiplication 

70 


MSP 

Selects floating-point or fixed point multiplication 

71 


MMSWSEL 

Selects the msw/lsw of the multiplier output 

72 to 80 

ALUI0-ALUI8 

ALU instructions 

8i 


ASELAO 

Load ALU's AO register from AIN port 

82 


ASELAl 

Load ALU's Ai register from AIN port 

83 


ASELA2 

Load ALU's A2 register from AIN port 

84 


A5ELA3 

Load ALU's A3 register from AIN port 

85 


ASELBO 

Load ALU's BO register from BIN port 

86 


ASELBl 

Load ALU's B1 register from BIN port 

8'’ 


A5ELB2 

Load ALU's E2 register from BIN port 

88 


ASELB3 

Load ALU's B3 register froni BIN port 

89 


BIM#/X 

Direct xbar2 output or multiplier output to BIN port 

90 to 

91 

ARDA0;ARDA1 

Selects A operand register for ALU operations 

92 to 

93 

APDB0;ARDB1 

Selects B operand register for ALU operations 

94 


AMSWSEL 

Select the msw/lsw of the ALU output 

95 


AMRES 

Reset ALU and the multiplier 


APPENDIX 3 

CELL INSTRUCTION SET 


Field #1 Program sequencer Default cont 


cont 

0000000 

Continue 

jpcof 

OOiOiOl 

If flag jump PC 

jpcnf 

OilOlOl 

If not flag jump PC 

jtwo 

iOiccOi 

If condition jump PC+2 

jda 

illccil 

If condition jump data absolute 

;?dr 

illccOl 

If condition jump data relative 

jdi 

lllcciO 

If condition jump data indirect 

adrst 

lOOilii 

If sign of Cl jump data, Ci<,=Pi, else Ci<; 

jrc 

ilOccii 

If condition jump Ri 

jrs 

llOllu 

If sign of Cl jump Ri, Ci<=Ci-l 

3 sa 

liiccOO 

If condition jump subroutine, absolute 

3sr 

lliccOi 

If condition jump subroutine relative 

rtn 

lOiccll 

If condition return from subroutine 

branch 

lOOccii 

If sign of Cl jump Ri, else, Ci<=Ci-l, 
if condition jump data 

psdss 

OOililO 

Push data onto SS 

ppssd 

OiililO 

Pop SS to data port 

wrssp 

000 11 10 

Write SSP 

rdssp 

0101100 

Read SSP 

dssp 

0000010 

Decrement SSP 

sgsp 

0000111 

Select GSP 

slsp 

0000110 

Select LSP 

rdrsp 

0101111 

Read RSP 

wrrsp 

0001100 

Write RSP 

pspc 

0100011 

Push PC to RS 

psgsp 

0000101 

Push GSP to SS 

psdrs 

0011111 

Push data onto RS 

pprsd 

0111111 

Pop RS to data port 

airsp 

OlOlOil 

Add 1 to RSP 

sirsp 

0001111 

Subtract i from RSP 

s4r5p 

0111100 

Subtract 4 from RSP 

rdsr 

0101110 

Read SR 

wrsr 

0011100 

Write SR 

pssr 

0100001 

Push SR onto SS 

ppsr 

0100010 

Pop SR from SS 

wrcntr 

OlllOil 

Write Cl 

clrs 

0010100 

Clear sign bit 

sets 

0110100 

Set sign bit 

pscntr 

OOOlOii 

Push Cl onto SS 

ppcntr 

OOllOli 

Pop Cl from SS 

dccntr 

OllOOil 

Decrement Ci 

ifcdec 

lOlccOO 

If condition decrement Ci 

ccir 

0010001 

Clear current interrupt 

cair 

0000001 

Clear all interrupts 



rtnir 

OOOOOii 

Return from interrupt 

rdiv 

OlOiiOi 

Read interrupt sector and increment IVP 

wnv 

0001 iOi 

Write interrupt vector and increment IVP 

irmbc 

OOiOOii 

IR mask bitwise clear 

irmbs 

OOiOOlO 

IP mask bitwise set 

disir 

OOiOilO 

Disable interrupts 

enair 

OiiOliO 

Enable interrupts 

slir 

OOiOlli 

Select latched interrupts 

stir 

OiiOlli 

Select transparent interrupts 

slrivp 

OOillOi 

Write SLR<=D 5-2 and 

relife 

OlOOiOO 

Select IS-bit relative addressing 

reliZ 

OiOOiil 

Select i2-bit relative addressing 

rel8 

OiOOiiO 

Select 8-bit relative addressing 

idle 

OOiOOOO 

Idle 

ihc 

OiOOlOi 

Enable instruction hold control 

wcs 

0100000 

Write control store 

Field #2 

Address generator Default nop 

nop 

0000000000 

no operation 

yinc 

lOiiccrrrr 

Output and increment/initialize 

ydec 

iOiOccrrrr 

Output and decrement/initialize 

yadd 

iiccbbirrr 

Output and add of-fset/imtialize 

ysub 

iiccbbOrrr 

Output and subtract offset/initialize 

yrtr 

OOOlOirrrr 

Output and transfer R to R 

yrtb 

OOilbbrrrr 

Output and transfer R to B 

yrtc 

OOlOocrrrr 

Output and transfer R to C 

dti 

OOOOilliii 

Transfer D to I 

itr 

iOOOiirrrr 

Transfer I to R 

btr 

OiOObbrrrr 

Transfer B to R 

rtd 

OOOiOOrrrr 

Transfer R to D 

ctd 

OOOOliOOcc 

Transfer C to D 

btd 

00001 10 ibb 

Transfer B to D 

ltd 

OOOOlllOn 

Transfer I to D 

yor 

Olllbbrrrr 

Output & OR B with/to R 

yand 

OllObbrrrr 

Output & AND B with/to R 

yxor 

QlOlbbrrrr 

Output & XOR B With/ to R 

yasr 

OOOlllrrrr 

Output & arithmetic shift right R to R 

ylsl 

OOOllOrrrr 

Output & logical shift left R to R 

rst 

0000000001 

Reset control register 

dtcr 

0000101110 

Transfer from data port to control register 

crtd 

0000101111 

Transfer from control register to data port 

seti 

OOOOlOOiix 

Set conditional re-initialization on CMP flag 

setp 

OOOOlOlOpp 

Set chip precision 

sety 

OOOOOlOOlx 

Set Y port to transparent/latched mode 

selr 

OOOOOilOlx 

Select upper/lower address register bank 

selb 

000001100) 

Select upper /lower base register bank 

setu 

OOOOOiOilx 

Set update niode (post/pre' 

seta 

OOOOOlOiOx 

Set/clearconditional AIR execute mode 

wra 

0000101100 

Write AIR with data bus 

rda 

0000101101 

Read AIR at data bus 

Ida 

0000011110 

Load AIR from instruction port on next cycle 

ydty 

000001 11 11 

Pass data bus to Y port 

yrev 

iOOlbbrrrr 

Output address register in bit-reversed format 



Field #3 

Mdress generator input data select Default nodsel 

nodsel 

0 

Disables address generator's data port 

dsel 

1 

Load data from data port to the specified register 

Field #4 

Cell Data transfer Default nocdxfer 

nocdxfer 0 

Deselect cell data transfer latch 

cdxfer 

i 

Transfer data from sequencer to address generator 
or vice-versa 

Field #5 

Flag input select 

Default flag? 

flag? 

ill 

Select SFLAG 

flagO 

000 

Select HFLAG 

flagl 

001 

(undefined) 

flagZ 

010 

(undefined) 

flags 

oil 

(undefined) 

flag4 

100 

(undefined) 

flags 

101 

(undefined) 

flags 

110 

Select address generator zero flag 

Field #6 

Set status 

Default set 

set 

1 

Debug LED OFF 

noset 

0 

Debug LED ON 

Field #7 

Data memory 

Default dmnu 

dmnu 

mill 

Data-memory is not in use 

hrw 

011111 

Sets data-memory for host read/writes 

dmtxb 

loom 

Transfer from data-memory to crossbar le, cell 
read 

atdm 

101000 

Write ALU output to data-memory 

mtdm 

101010 

Write multiplier output to data-memory 

Field #8 

X queue write 

Default xqnu 

xqnu 

1111 

X queue is not in use (for write operation) 

atxq(Wy) 

OwwO 

Write ALU output to X queue location Wx 

mtxq(Wx) 

Qwwl 

Write multiplier output to X queue location Wx 
(Wx = wO, wi. w2 or w3) 

Field #9 

X queue read 

Default xqtxbCrO) 

xqtxb(Rx) rr 

Read from X queue location rr and transfer it to 
crossbar (Rx = rO, rl, r2 or r3) 

Field #10 

Y Queue write 

Default yqnu 

yqnu 

1111 

V queue is not in use (for write operation) 

atyQ(Wx) 

OwwO 

Write ALU output to Y queue location Wx 

mtyQ(W>'‘ 

Owwl 

Write multiplier output to Y queue location Wa 
(W x. = wO. wi w2 or w3) 



Field #11 Y queue read 


Default yqtxb(rO) 


yqtxb(rr) 


rr 

Read from Y queue location rr and transfer it to 
crossbar (rr = rOj rl, r2 or r3) 

Field #12 

Crossbar 1 

Default xqtxbl 

xqtxbl 


Oi 

Select X queue as the output of crossbar 1 

yqtxbi 


10 

Select Y queue as the output of crossbar 1 

dmtxbi 


ii 

Select data-memory as the output of crossbar 1 

litxbl 


00 

Select literal "0" as the output of crossbar 1 

Field #13 

Crossbar 2 

Default xqtxb2 

xqtxbZ 


01 

Select X queue as the output of crossbar 2 

yqtxbZ 


iO 

Select Y queue as the output of crossbar 2 

dmtxbZ 


11 

Select data-memory as the output of crossbar 2 

litxbZ 


00 

Select literal "1" as the output of crossbar 2 

Field #14 

Clock counter 

Default discntr 

discntr 


11 

Disable clock-cycle-counter 

enbcntr 


01 

Enable clock -cycle-counter 

clrcntr 


10 

Reset clock-cycle-counter to zero 

Field #15 

Multiplier register loading instructions Default Inmr 

Intnr 


0000 

Load no multiplier register (s'' 

ImraO 


0001 

Load multiplier register AO 

Imral 


0010 

Load multiplier register A1 

ImrbO 


0100 

Load multiplier register BO 

Imrbi 


1000 

Load multiplier register Bi 

ImraObO 


0101 

Load multiplier register AO and BO 

ImraObi 


1001 

Load multiplier register AO and Bi 

ImraibO 


0110 

Load multiplier register Ai and BO 

Imralbl 


1010 

Load multiplier register Al and Bi 

Field #16 

Multiplier instructions Default- fxpnfblall 

fxpmCbial] 

000 

Multiply EBil and EAil, (Ai and Bi contains twos- 




complement integer) 

fjpmCblaOl 

001 

Multiply EBil and EAOl, (AO and Bi contains twos- 




complement integer) 

fxpmCbOai] 

010 

Multiply EBOl and CAil, (Ai and BO contains twos- 




complement integer) 

fxpmCbOaOl 

oil 

Multiply EBOl and EAOl, (AO and BO contains twos- 




Gomplement integer) 

flpmEblai] 

100 

Multiply EBil and EAil, (Ai and Bi contains 32-bit 




single-precision floating-point number) 

flpmCbiaOl 

101 

Multiply EBil and EAOlj (AO and Bi contains 32-bit 




single-precision floating-ooint number) 

flpmEbOBi] 

110 

Multiply EBOl and EAil, (Ai and BO contains 32-bit 


single-precision -floating-point number) 





flpmCbOaO] lil Multiply [B03 and [A03, (AO and BO contains 32-bit 

single-precision floating-point number) 


Field #17 Multiplier output Default smmsw 


smmsw 

1 

Output most-significant-word of the product 

smlsw 

0 

Output least-significant-word of the product 

Field #18 

ALU instructions 

Default nop 

nop 

000000000 

No operation 

ladd 

001000011 

Add A and B 

isubb 

OQiOOiOil 

Subtract B from A 

isuba 

001000111 

Subtract A from B 

laddwc 

001010011 

Add A and B with carry 

isubwbb 

OOiOliOil 

Subtract B from A with borrow 

isubwba 

001010111 

Subtract A from B with borrow 

mega 

001000101 

Negate A 

inegb 

001001010 

Negate B 

laddas 

001100011 

Absolute value of A plus B 

isubbas 

001101011 

Absolute value of A minus B 

isubaas 

001 1001 11 

Absolute value of B minus A 

compla 

000000101 

Complement A 

complb 

000001010 

Complement B 

passa 

000000001 

Pass A unmodified 

passb 

000000010 

Pass B unmodified 

aandb 

000010010 

Bitwise logical AND of A and B 

aorb 

000100010 

Bitwise logical OR of A and B 

axorb 

000110010 

Bitwise logical XOR of A and B 

clr 

100000000 

Clear all status flags 

sadd 

111000011 

Add A and B 

ssubb 

111000111 

Subtract B from A 

ssuba 

111001011 

Subtract A from B 

soomp 

111001111 

Compare A to B i e , A-B 

saddas 

011000011 

Absolute value of A plus B 

ssubbas 

011000111 

Absolute value of A minus B 

ssubaas 

011001011 

Absolute value of B minus A 

sfixa 

011001101 

Convert SP fl-pt A to twos-complement integer 

sfixb 

011001110 

Convert SP fl-pt B to twos-complement integer 

sfloata 

011100101 

Convert twos-complement integer A to SP fl-pt 

sfloatb 

011100110 

Convert twos-complement integer B to SF fl-pt 

spassa 

011110001 

Pass A unmodified 

spassb 

011110010 

Pass B unmodified 

Field #19 

ALU Ax registers loading instructions Default Inaar 

Inaar 

0000 

Deselect all A registers of the ALU 

laaOr 

0001 

Load ALU register AO 

laalr 

0010 

Load ALU register A1 

laaZr 

0100 

Load ALU register A2 

laa3r 

1000 

Load ALU register A3 
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Field #20 ALU Bx registers loading instructions Default Inabr 

Inabr 00000 Deselect all B registers of the ALU 

labOrx 10001 Load ALU BO register from the o/p of crossbar 2 

lablrx 10010 Load ALU B1 register from the o/p of crossbar 2 

lab2rx 10100 Load ALU B2 register from the o/p of crossbar 2 

labOrx 11000 Load ALU B3 register from the o/p of crossbar 2 

labOrm 00001 Load ALU BO register from the o/p of multiplier 

lablrm 00010 Load ALU B1 register from the o/p of multiplier 

lab2rm 00100 Load ALU B2 register from the o/p of multiplier 

labSrm 01000 Load ALU B3 register from the O'^p of multiplier 

Field #21 ALU operand registers Default aorOsOaOl 

aorCbOaOl 1010 Use register BO and AO for source operands 

aorCb2a23 0000 Use register B2 and A2 for source operands 

aorCb2a33 0001 Use register B2 and A3 for source operands 

aorCb2a03 0010 Use register B2 and AO for source operands 

aorCb2al3 0011 Use register B2 and A1 for source operands 

aorCb3a23 0100 Use register B3 and A2 for source operands 

aorCb3a33 0101 Use register B3 and A3 for source operands 

aorCb3a03 0110 Use register B3 and AO for source operands 

aorCb3al3 0111 Use register B3 and A1 for source operands 

aorCbOa23 1000 Use register BO and A2 for source operands 

aorCb0a33 1001 Use register BO and A3 for source operands 

aorCb0al3 1011 Use register BO^ and A1 for source operands 

aorCbla23 1100 Use register B1 and A2 for source operands 

aorCbla33 1101 Use register B1 and A3 for source operands 

aorCbla03 1110 Use register B1 and AO for source operands 

aorCblal3 1111 Use register B1 and A1 for source operands 

Field #22 ALU output Default samsM 

samsw 1 Output most-significant-word of the ALU result 

salsw 0 Output least-significant-word of the ALU result 

Field #23 ALU/multiplier reset Default amopr 


amopr 

amres 


0 

1 


ALU and multiplier are operational 
Reset ALU and multiplier 



APPENDIX 4 


□ ANALOG 
DEVICES 


Word-Slice™ 
Program Sequencer 


FEATURES 

16>Bit Microcode Addressing Capability 

Look-Ahead^*'^ Pipeline 

Extensive Interrupt Processing, With Ten On-Chip 
Interrupt Vectors 

70n$ Cycle Time, 25ns Clock<to>Address Delay 

64*Word RAM for Storing 
Subroutine Linkage 
Jump Addresses 
Counters 
Status Register 

375mW Maximum Power Dissipation with 
CMOS Technology 

48-Pin DIP 



WORD SLICE^*^ MICROCODED SYSTEM WITH ADSP 1401 


ENERAL DESCRIPTION 

he ADSP- 1401 is a high-speed microprogram controller op- 
mi/cd for the demanding sequencing tasks found in digital 
gnal processors and general purpose computers In addition to 
gh speed (25ns clock-to-address delay) and large addressmg 
nge (64K of program memory), this Word-Slice component 
s unique features that make it highly versatile 

• on-chip storage and control of ten prioritized and 
maskable mterrupts 

• four decrementing event counters 

• absolute, relative and indirect addressing capability 

• download capabihty (wnteablc control store) and 

• a dynamically configurable 64-word RAM 

e A DSP- 1401 microprogram sequencer’s mam task is to 
>vidc the appropriate microprogram addressing to support 
igramnung requirements (e g , loopmg, jumpmg, branchmg, 
iroutines, condiuon testing and mterrupts) An miemal Look- 
sad pipelme, controlled by both phases of the clock, allows 
ADSP- 1401 to satisfy these requirements at very high speed 

ring each micro-instrucuon, the ADSP- 1401 monitors the 
dinons and mstructions to deienmne the next microprogram 
ress This address can come from one of several sources the 
k, the jump address space in the RAM, the data port, the 
rrupi vectors, or the microprogram counter An extensive 
)f conditional mstructions are also available, mcluding jumps, 
iches, subroutmes, mterrupts, and wnteable control store 


The ADSP-140rs mtemal 64-word RAM is user-configurable 
into three regions, subroutme stack, register stack and indirect 
jump address space The subroutine slack is used for linking 
interrupts and subroutmes and, durmg their cxecuuon, allow 
storage of system states The register stack allows associauon of 
unique jump addresses with various levels of interrupts and 
subroutines (both local and global stacks arc provided) Indirect 
jump capability is also supported, addressing for which is provided 
at the data pon 

Interrupts arc handled entirely on chip The ADSP- 1401’$ internal 
mtcrnipi control logic mcludes registers for eight external (user) 
mtcmipi vectors, a mask register, and a pnonty decoder Two 
addiuonal vectors are reserved for micmally-gcnerated mterrupts 
resultmg from coimtcr underflow and stack limit violaaon A 
stack limit violauon is caused by stack overflow, underflow or 
collision A mechanism is provided for rccovcnng from stack viola- 
tions 

The ADSP-140rs four dccrementmg 16-bit counters are used to 
track loops and events These counters generate a signal when 
negauve This negative condition is used by several condiuonal 
mstructions and can also tngger an mtemal mierrupt 






EXTERNAL 

INTERRUPTS 

(EXIR^,) 


\ISTRUCTION 

(U o) 


FLAG 





ADSP T401 Block Diagram 

























ADDRESSING MODES 

Direct both absolute and relative 
Indirtci from inicrnal RAM 


HARDWARE FEATURES 
Instruction Port 
Bidirectional Data Port 
Four Input Address Multiplexer 
Three Stack Pointers 
Four Event Counters 
Condition Flag 

Eight Prioritized and Maskable User Interrupts 
TTR Pm 
Trap 

Three-State 

Reset 


INSTRUCTION TYPES 
Jumps and Branches 
Stack Operations 
Status Register Operations 
Counter Operations 
Interrupt Control 
Relative Address Width Controls 
Insirueiion Hold Control 
Writeable Control Store 
Dedicated Counter Underflow Interrupt 
Dedicated Stack Overflow Interrupt 


ADSP-1401 PIN ASSIGNMENTS 


Pin Name 

l6-l0 


Y,5-Yo 

Dis-Do 


EXIR 4-1 


CLK 

FLAG 

TTR 

Vnn 

GND 


Dcscnption 

The 7-bit microihstrucuon controlling the 
ADSP-1401 

Outp!ui bus which provides addresses to the micro- 
program memory 

Bidirccuonal Data bus for transferring data to or 
from the ADSP-1401 

Four external miemipt request Imes Note that in 
tcmal circuitry supports 8 mterrupts with the aid of 
an external 2 to 1 multiplexer 

External clock mput 

An mput used for conditional instructions Its 
source is usually a condiuon multiplexer 

A mulu-purpose pm accommodating traps, output 
disable and reset 

+ 5 Volt supply 
Ground 




DEVICES 


Address Generator 



FEATURES 

16-Bit Addresses with Higher Precision Options 
High Speed, Clock-to-Valid-Address Delay of 20ns 
Look-Ahead^*^ Pipeline 
Versatile Addressing Hardware 
30 16-Bit Registers 

16>Bit ALU with Left/Right Shift & Carry I/O 
Comparator 
Bit Reverser 
Dual Ports 

Powerful Single-Cycle Looping Instructions 
175mW Maximum Power Dissipation with 
CMOS Technology 
48-Pin DIP 


[GENERAL INFORMATION 

I he AI)SP-141() IS I ist, Ikxihlc iddress gcncralor opnnn/ed 
or digiial signal/array processors and other high performance 
ompuiers This low-power CMOS device rapidlv generates the 
lata memory addresses required by routines such as digital 
liters, FFTs, matrix operations, and DMAs With its 16 bn 
rchiteelurc, registers, dual ports, and speed, the 4H-pin ADSP 
410 improves performance and reduces board space substantially 
lative to bit-slice solutions 

he ADSP-1410’s architecture features a I6*bu ALU, a com- 
irator, and 30 16-bu registers I he registers are organued into 
ur files sixteen address (R) registers, six offset (B) registers, 
ur compare (C) registers, and four initializauon (I) registers 

1 C ADSP-1410 rapidly executes key address generating opera- 
ms In a single instruction cycle, the device can 

• output a 16-bit memory address, 

• modify this memory address, and, 

• delect when the address value has moved to or beyond a 
pre-set boundary and conditionally loop back to the 

top of a circular buffer 

iscqucntly, circular buffers and modulo addressing for data 
mories can be implcmcnicd without overhead 

AI)SP-14I() s 10-bu microcode instructions inelude corn- 
ids for looping, register rcad/wntes, internal data transfers, 
logical/shifi operations Instructions are normally supplied 
a an external source However, an internal Alternate In- 
ciion Register (AIR) can provide the instruction under external 
rol, allowing microcode to be conserved in many 
ications 


The ADSP 1410 has a l6-bii address (Y) port for outputting 
addresses and a J6-bit data (D>port for I/O bctwein inlirml 
and externa! registers Also, an internal path allows extern il 
data, provided vja the D port, to serve as an ALU sourer and'or 
to be dirccilv output over the Y port for a DMA capabilitv 

Double-precision (30 bitk single-cvele addressing e in be per 
formed bv cascading >.wo ADSP-I410 s, with the A\SH ol each 
chip s D and Y port dedicated to interchip communication 
'Alternatively, a single AG can provide double-precision addresses 
at a rate of one per two clock cvclcs 

The Look-Ahead’^' pipeline eliminates the need for an external 
microcode pipeline register by internalv latching instructions 
and addresses, microcode bus may be dirccilv routed to the 
\ ADSP-1 410 from microcode memory Logically the Look 
Ahcad’^' pipeline is split into two havles the first, located at 
the insiruciion (and data) port, and the second, located at the 
address port Each half of the pipeline (input vs outpuG has a 
transparent latch which operates out of phase with the other 
the address laieh is transparent during the first half of the evele 
(clock HI), while the input latches (instruction and data) arc 
transparent during the second half of the cvclc (clock I This 
complementarv arrangement allows new instructions to be deeoeled 
(in preparation for the following cycle) while the pre>grim ulelress 
for the current cycle is held steady 



ADSP-1410 OVERVIEW 

Digital Signal Proctssing (DSP) ind array processing systems 
require fast, flexible address generation cireuitrv An Address 
Generator (AG) supplies the address of a location in data or 
eocffieieni memorv 1 he value residing at the specified address 
IS fetched and fed to an anlhmciie unit for processing I he AG 
must then modify the address pointer in anticipation of the next 
data fetch For algorithms that repetitively Uxip through data 
buffers, the AG may need lo comp ire the address to a buffer 
end and conditionally loop back to the lop of the buffer 1 inally, 
to maximize throughput, an AG must perform its addressing 
tasks rapidly and without overhead 

With the ADSP-141(), 16 bn pointers vO memory are stored m 
an address (R) register file Since an AG must track several 
pointers concurrently, sixteen R registers, denoted R„, are pro- 
vided If we denote Y as the address port , the operation “Y Rn” 
corresponds to the AG supplying an address from register 

After supplying an address, the AG must update the pointer for 
the next memory fetch The updating may be as simple as an 
increment but, more general!), involves adding or subtracting 
an arbitrary offset value Also, algorithms generally access several 
different offset values lo this end, the AG provides six offset 


ADDRESS SOURCES 

- Sixteen internal R registers 

- External data provided over the D port 

OFfSm SOURCES 

- Six internal B registers 


- Data Pon 


OFFSET OPERATIONS 


- Increment 

(R„-^ R„4 I) 

- Decrement 

(Rn-^ R„-l) 

- Add Offset 

(R„-e.R„ + BJ 

- Subtract Offset 

(R-n Rn “ Bm) 

— Singlc-IJii Lefi/Right 


Shifts 


- Logical Operauons 

(AND,OR,XOR) 


CONDITIONAL RE INITIALIZATION 

— Independent Inhibii/Enable for each of four 
imaalizauon registers 

“ Condiuonal AIR execution (used for true 
modulo addressmg) 

OUTPUTAJPDATE SEQUENCE 

- Normal (Prt-Updaie) Mode (output the address 
before update) 

- Post-Update Mode (output the address after 
update) 

PRECISION 

- Smgle chip supplies 16-bit addresses 

- Two chips cascaded provide 30-bit addresses 

- One chip provides 30-bii addresses in two 
cycles 


registers, denoted and can execute in ^ single-cycle the core 
operation 

Y ^ Rn> Rn 

In DSP applications, data arrays arc often addressed as circular 
buffers I hat is, when addressmg reaches the buffer end, it 
wraps back to the beginmng of the buffer 1 o implement this 
looping, the AG compares the supplied address to one of four 
eompirc registers, denoted C, If the address has moved lo or 
beyond the end of the boundary (Rn>rC,), the device can 
transfer an imualizauon register value, denoted I,, to the register 
(Rn"^ I,), otherwise, it is updated m normal fashion 
(R„-^ Rn + B„,) To minimi 7 c overhead, the AG can execute 
normal updates while also performing conditional rc-iniiializations, 
again, m one core operaaon 

IF(R„>C,) R„4~I„ else R„-^ R„ + B„ 

Since the above instruction handles the looping required of 
circular buffer addressing, it is termed a loopmg instruction To 
a large extent, the ADSP- 1410’s architecture and msiruction set 
revolve around efficient implementation of this mstrucuon 
However, many variations of this instruction arc supported on 
the device and spelled out in the following sections 


ADSP.1410 PIN ASSIGNMENTS . 

PIN NAME DESCRIPTION 

Yis~Yo The address (Y) output port Insinglc-chip/doublc- 
prceision nuxle, the MSB (YJ^) indicates whether 
the supplied address is the MSW or LSW (see 
Precision Modes) In rwo-chip/double-precision 
mode, the MSB conveys the carry/shift bn from 
the Least Significicnt (LS) to the Most Signifieicnt 
(MS) chip 

Di 5 - Do The bi-direcuonal data (D) pon In two-chip/dou- 
blc-prccision addressing mode, the MSB (Djs) of 
this port conveys CMP status from the partner 
chip 

I 9 - lo The mstrucuon pon 

CMP/Z A dual function pin Looping instructions, which 

compare address register values to compare 
register values, assert this pm HI lo convey 
CMP status if 1 ) R>C for positive offsets, or 
11 ) R<C for negative offsets Logical/Shifi in- 
stniciions assert this pm HI to convey the Zl RO 
status of the result 

DSEL Data Select control Asserting this control HI 

causes data set up on the data port to substitute 
for the R value spctificd m the instruction 

AIR Enable Alternate Insirucuon Register control Asserung 

this control HI causes the device 10 execute an 
instruction stored m the internal AIR, rather 
than the mstrucuon set up on the instruction 
pon 

CLK Clock 

Vjj 4-5 Volt Power Supply 

Ground 


GND 



































ANALOG 

DEVICES 


High-Speed 64-Bit IEEE Floating Point 

Multiplier and ALU 


PRELIMINARY TECHNICAL DATA 

FEATURES 

implements a Full Floating Point Processor Solution, 
Handling 32-Bit and 64-Bit Floating Point Numbers 
Fully Compatible with IEEE Standard 754 
Three Data Formats 
32-Bit Single Precision Floating Point 
64-Bit Double Precision Floating Point 
32-Bit Fixed Point 
Fast 

Single Precision Throughput of 100ns/ 

10 MEGAFLOPS for All Operations 
Double Precision Throughput of 
100ns/10 MEGAFLOPS (ADSP-3220) 

400ns/2 5 MEGAFLOPS (ADSP 3210) 

32-Bit Fixed Point Throughput of lOOns/IOMHz 
for All Operations 

Flexible I/O Structures Support Full Data Transfer 
Rate 

ADSP-3220 Supports Three- and Two-Port 
Structures 

ADSP-3210 Supports Two-Port Structures 
One Internal Pipeline Stage in Each Part 
Multiple Input Registers Associated with Each 
Input Port 

400mW Max Power Dissipation Per Chip with CMOS 
Technology 

Fully Registered Inputs, Outputs, and Control Signals 
Three-State Outputs with Separate Enables 
100-Lead Pin Grid Array (ADSP-3210) 

144-Lead Pm Grid Array Package {ADSP-3220) 

NERAL DESCRIPTION 

e ADSP-3210 Floating Point Muluplier and ADSP-3220 
ating Point ALU are high-speed, low-power anthmeuc pro- 
ofs with data formal conforming to IEEE Standard 754 
ADSP-3210 and AI)SP-3220 comprise the basic dements to 
Icmeni a high-speed numerical processor with op^crations on 
e data formats 32-bn IEEE singlc-precision floaung point, 

»il lEFF double-precision floating point, and 32-bu two’s 
plement fixed poini 

icaied in CMOS, the ADSP-3210 and ADSP-3220 provide 
^h-speed (100ns cycle tune at 70'’C) and low power (less 
400mW power dissipation per chip) floating point processor 
ion I he processors offer very high throughput for all three 
formats 32 bit lELL Single-Precision results produced 
100ns, 64-bit IEEE Double-Precision results every 100ns 
»P-3220) and every 400ns (ADSP-3210), and fixed point 
s every 100ns 


The chips’ data formats and floatmg point operations conform 
to the proposed IEEE Standard 754, Draft 10 0, assuring complete 
software portability for computational algorithms adhering to 
the Standard 1 he chips support all four rounding modes in the 
Standard for all three data formats All four exception conditions 
detected — overflow, underflow, invalid operation, and inexact 
result — arc provided as dedicated status pins, minimi/ing response 
time of the external system to exceptions 

The ADSP-3210 and ADSP-3220 have a powerful msiruction 
set designed for a systcms-lcvcl implementation of function 
calculations Specific instructions arc included lo facilitate such 
functions as tabic lookup, floatmg point divide and square root, 
quadrant normalizaaon for trig funcuons, and operaaons on 
denormals 

The ADSP-3210 Block Diagram shows the Floaung Point Mul- 
upher’s two-port structure one 32-bit input port and one 32-bit 
output port Two 32-bit registers are available for each of the A 
and B operands Data inputs and outputs transfer at twice the 
cycle rate, performing two 32-bil input operations and supplving 
an enure 64-bii or 32-bii output product on every cycle All 
inputs and outputs are registered 

The ADSP-3220 Block Diagram shows the three-port structure 
of the Float ing Point Al-LJ two 32-bn input rH^rts and one ^2 
bn output port Ihc ADSP-3220 can be configured to have one 
or two mpul ports Four mpui registers are available for each of 
the A and B operands, each mpul port can load any of the eight 
input registers 



Uk ADSP ^220 s ihrct j><)rt siructurc actomnindaics iIk lull 10 
Mi CiATI C)I*S throughput rale lor Double Preeision (DP) oper- 
ations, bading two 64 bit operands on eaeh evele and operating 
inte.rnall> with 64 bit wide d it i paths I he ADSP ^210 s two port 
structure leeoniniodaies the lull DP throughput rale (4 c\eles'2 5 
MbOAFLOPS), performing all lour DP cross-produeis in 4 
eveles with its 32 32 multiplier arriv 

I ixed ix)»nl iddiDon or stihlrietion icecpls two 12 lul twos 
eoiiipleincut ojKrinds iiid pioduecs i 12 bit two s coinpkimnt 
result every 100ns I ixed point mulliplieaiion produecs a 64 bit 


two s eomplement produet e\er\ 100ns 1 he 64 bn product on 
the ADSP-3210 may be shifted left by one bit on output to 
eliminate a redundant sign bit in the most signifieani word of 
the product 

I he ADSP-3210 and ADSP~3220 supp^)rl the gradual underflow 
proMSions of the IEEE standard A FAST mode is included on 
eaeh chip, which sets results less than the II 1 I norindi/ed 
fornut to zero 1 AS 1 mode simplifies underflow exception 
h nulling while ret lining all the other benefits ol the high el\ n imie 
range and preeision m the IEEE Floating Point format 


ADSP 3210 I 1 OMING POIN I AlUL 1 IPLILR 


REGISTER 

CONTROLS 


DIN,, 



Mswsri cirra doot,, 


STATUS SHIP 


RND 

ABS 

DP 


SP 

WRAP 

FAST 
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PIN DEFINH IONS 

(Positive Logic Naming Convention is Used Throughout) 

Unless otherwise noted, pm definitions apply to both the 

ADSP-3210 and ADSP-3220 

SP denotes 32-bit Singlc-Precision floating point 
I>P denotes 64 bit Double Precision floating point 
I IXl I) denotes 32 bit fixed j>oint format 
X denotes the Input Register Number 

Data Lines 

DIN,, - DIN„ (ADSP-32I0 only) 32 Data Input Pins 
DIN,, IS the most signifieant bit (MSB) 

AIN3} - AINo (ADSP-3220 only) 32 Data Input pins 
AIN3, IS the MSB 

BIN3J - BINo (ADSP-3220 only) 32 Data Input pins 
BIN3, is the MSB 

DOUT31-DOUT0 32 Data Output pins DOUT3, 

IS the MSB 

Asynchronous Input Control Lines 

IPORTl, IPORFO (3220 only) Controls ^^hlch select the port 
configuration, source input port(s), and destinaaon mput 
registers These controls should be hardwired to the desired 
port configuration 


ADSP-3220 


Signal 

ShLAO 
SELAl 
Sn A2 
SI I A^ 

SLI BO 
SELBl 
SFI B2 
Si:i B3 


Action for Two-Input 
Port Configuration’ 


Load AO on 
Load A 1 on 
I oad A2 on 
I oad A3 on 


Load BO on 
Load B1 on 
1 oad B2 on 
l^iad B3 on 



Action for One-Input 
Port Configuration’ 


Load AO on 
Load A 1 on 
I o kI A2 on 
I oad A ^on 


LoadBOon 
Load B 1 on 
Load B2 on 
I o id B^ on 


NOIL 

‘(ADSP 3220 only) IPORTl, IPORTO control one input or nvo-input port 
configurauons 


RDAl, RDAO, RDBl, RDBO Input Operand Select controls 
selecting the operands for the operauon 


ADSP-3210 

State 

Register 

Selected 

RDAO 

0 

A1 


1 

AO 

RDBO 

0 

Bl 


I 

BO 




Port 


IPORTl 

IPORTO 

Configuration 

Source and Destination 

0 

0 

Two Port 

AIN,, <, B Registers 

BIN,, 0 A Registers 

1 

0 

One Port 

AI N 3 1 _o source for all registers 

0 

1 

One Port 

KIN3 1-0 source for all registers 

1 

1 

I wo Port 

AIN,,_o A Registers 

BIN,, 0 B Registers 


Registered Input Control Lines 

All registered mput controls are latched on the nsing edge of 
the clock All controls are clocked with the mtermediaic results 
at the internal pipeline register, keeping the control lines associated 
with the proper operands 

CLK Clock Input The nsmg edge of CLK latches the controls 
for an operation, inmate's the operation, and cloeks input 
data mto the selected rcgisier(s) 1 he falling edge of CLK 
only clocks mput data into the selected register(s) 

SI’I Ax, SI I Bx Input Register Select tonirols for folding the 
input registers with the input port data 


ADSP-3210 

Signal 

Action 



sn AO 

I oad AO on 

-S~ 


SI I A1 

I (ud A 1 on 



SLLBO 

Load BO on 



SELBl 

Load B 1 on 



(ADSP 3210 only) For double precision operations, both 
RDAO and RDBO must be HIGH at the start of the operation 
After initiation of the DP operation, the ADSP-3210 controls 
the DP multiplieaiion, and RI>A0 and RDBO an ignored 
unui completion 


ADSP-3220 Register 

State Selected 

RDAl RDAO 0 0 A2 

0 1 A3 

1 0 AO 

1 1 A1 

RDBl RDBO 0 0 B2 

0 1 B3 

1 0 BO 

1 1 Bl 


(ADSP-3220 only) For double precision operauons, onh 
RDAl and RDBl select the DP operand pairs, and RDAO 
and RDBO must be HIGH 

ABSA, ABSB Absolute Value eoniroh ABSA e mvts the thip 
to convert the selected A op>erand to its absolute \alue before 
performing the oi>crauon, ABSB causes the B operand to 
take on its absolute value On the ADSP-3220 Absolute 
Value is available for all SP, Dl\ and I 1X1 D ojxrands On 
the ADSP-3210, Absolute Value is only available for SP and 
DP operands 

RNDl, RNDO Rounding mode controls Sec Mcihixi of 
Operation for expianauon of rounding The round control 
modes are 


RNDO RNDl 

0 0 

0 1 

1 0 

1 1 


Rounding Mode 

Round to Nearest Number 
Round to Plus Infinity 
Round to Zero 
Round to Minus Infmiiv 



SP (ADSP-3210 only) Single Precision mode (acuve HIGH) 
Selects 32*bjt Single Precision format for both operands and 
product 

DP (AI)SP-321() only) Double Precision mode (active HIGH) 
Selects 64-bii Double Precision format for both operands and 
product 

(ADSP-3210 only) If neither SP nor DP is HIGH on the 
rising edge of GLK, the 3210 will operate in FIXLD mode, 
muluplying two 32-bit Fixed Point two’s complement operands 
and producing a 64-bit two’s complement product Asserting 
both SP and DP is an illegal sutc, causing an indeterminate 
output 

Ig - lo (ADSP'3220 only) Instrucuon control lines For ADSP 
3220 instrucuon set, see Method of Operauon secuon Selecuon 
of SP, DP, or FIXED data formats is explicit in the ADSP-3220 
Instruction word 

FAST Fast mode pm When FAST is acuve (HIGH), an un 
derflow will return a result of all zeroes for SP and DP oper- 
ations FAST has no effect on FIXED mode operations If 
FAST IS inactive (LOW), then the chips produce lEEh-com- 
pauble outputs for denormaiized and underflowed results 

(ADSP-32I0 only) A floating point underflow result with 
FAS I I OW (in I m(Kle) will return a “wrapfxrd number ’ 
eorreel Iraeliou ind sign, wnii ex{>«)fien( a iwo s eomplemcnt 
negative number' With FAS 1 HIGH, denormaiized inputs 
are forced to zero for the operauon 

(AI)SP-3220 only) FAS I I OW gener ues proper denorm ili/cd 
oiiipuls lor underllowed results With I AS 1 HICjH, denor- 
maiized and underflowed results are forced to zero, but de- 
normal mputs are not modified before performing the operauon 
FAST HIGH also forces to zero the dcnormal and underflowed 
results of DP to SP conversions and comparison operations 

WRAPA, WRAPB (ADSP-3210 only) Control pm that labels 
the selected A or B register as a denormaiized (“wrapped”) 
number I he multiplier then treats the input’s exponent as a 
two’s complement negative number' 

SHLP (ADSP-3210 only) Control pm to shift left a 64-bit 
Fixed Pomt product When HIGH, SHLP shifts the 64-bii 
output register left by one bit on output (eliminating a re- 
dundant sign bn in the two’s eoniplernenl product), and 
shifts a zero into the LSB SHLP has no shift cflect on floating 
pomt outputs SHLP is clocked with the operands at each 
level of the pipeline, therefore a change in the state of SHI P 
will lake elleel on the output register one cycle later 

Non-Registered (Unclocked) Control Lines 

MSWSEL When true (HIGH), MSWSEL selects the most 
significant 32 hits of the output register on DOU When 

MSWSLL IS I OW, the least significant 32 bus arc selected 
for output MSWSEL takes effect on the output register 
immediately (it is not latched) For SP and Logical operations, 
the proper output is selected with MSWSEL HIGH 

OEN Output data enable OEN HIGH enables data on 

DOUT 3 i_o, OEN LOW causes the data output pms to be m 
a high-impedance state 

RESET Reset control pm When RLShI goes active (LOW), 
the chip is reset internally RLSTT dears all control functions 

‘See Application Note on Handling IFI I I xeeptions 


in the chip, sets all status fl ags to zero, but docs not clear 
the input registers RESET should be activated on power up, 
ensuring proper initializ-auon of the chip 

Status Outputs 

All status outputs arc acuve HIGH All status outputs except 
DENORM arc paired with their operands through the pipeline, 
so they become true when the corresponding result is eUKktd 
into the output register For specific condiuons causing each 
exception, sec Excepuons and Status Outputs secuon 

INEXO Inexact Result status output, generated when the 

result could not be expressed exactly in the destination format 
without loss of accuracy 

OVRFLO Overflow status output, generated when the result of 
a SP or DP operation is greater than the maximum representable 
number in the the destination format before rounding 

UNDFLO Underflow status output, generated when the result 
of a SP or DP operauon is less than the mimmum representable 
number in the dcstmauon format When underflow occurs, 
the result produced at the output depends on the state of 
FAST (sec FAST Pin Dcscripuon) 

INVALOP Invalid Operation status output, generated when an 
invalid operation as specified in the irhF Standard 7S4 
tKcurred (c g , 0 x k ) 

DENORM (ADSP-3210 only) Dcnormal status output This 
signal is active (HIGH) when a denormaiized number is 
deietied as an input operand, and informs iht svskin tint 
the input must be wrapi>ed by llie ADSP 3220 il the nuiliiplicr 
IS to handle it properly' DENORM becomes valid on the 
clock cycle after the operand(s) are loaded into the multiplier 
array The multiplier will set the dcnormal to zero and 
complete the multiplication A denormal input causes the 
same status flags to appear as a zero input would 

RNDCARO (ADSP-3210 only) Round Propagated status 
output This signal can only occur when the UNDFLO 
status output on the ADSP-3210 also becomes true 
RNCiCARO mdicates that a carry bit propagated across the 
fraction’s roundmg boundary when the fraction was rounded 
to the destination format RNDCARO is used in conjunction 
with INI XO to cnihlc the ADSP-3220 to iinwnp a wripjKd 
number correctly' 

Registered Status Inputs (ADSP-3220 onl>) 

RNIX ARI Round l^ropigated status input Same us 

RN1X.ARC) on the ADSP-3210, except that on the ADSP 3220 
this is an input The controlling system provides the RNDCARI 
input to the ADSP-3220 when performing the UNWRAP 
function' 

INLXIN Inexact Input Same as JnLXO on the ADSP- 3210, 
except that on the ADSP-3220 this is an input 1 he controlling 
system provides the ‘INEXIN input to the ADSP-3220 when 
performing the UNWRAP function' 

Other Signals 

VDD + 5 volt power supply 
GND Ground 


‘See Appheauon Noie on Handling IEEE Excepuons 




FEATURES 


DESCRIPTJON 


• First-ln/First Out dual port rrvemory 

• 1024 X 9 organtzatton 

• Low power consumption 

• Ultra high speed -35ns cycle time (28 SMHz) 

• Asynchronous and simultaneous read and write 

• Fully expandable by both word depth af>d/or bit width 

• Pin compatible with Mostek MK4501 but with Half Full Flag 
capability 

• Allows for deep word structure (1024) without expansion 

• Half-Full Flag capability in single device rrKxzle 
t Master/Slave multiprocessing applications 

• Bidirectional and rate buffer applications 

• Empty arid Full warning flags 

• Auto retransmit capability 

• High performance CEMOS ^ technology 

• Available In Plastic DIP CERDIP 300 mil sidebraze THINDIP 
LCC PLCC and Flatpack 

• Military product compliant to MIL-STD-883 Class B 


The IDT7202A Is a dual-port memory that utilizes a special First 
In/First Out algorithm that loads and empties data on a first m/first 
out basis The device uses Full and Empty flags to prevent data 
overflow and underflow and expansion logic to allow for unlimited 
expansion capability In both word size and depth 

The reads and writes are internally sequential through the use of 
rir>g pointers with no address Information required to load and un- 
load data Da^ is toggled in and out of the device through the use 
of the Write (W) and Read (R ) pins The device has a read/wnte cy 
cle time of 35ns (28 SMHz) 

The device utilizes a 9-bit wide data array to allow for control and 
parity bits at the user s option This feature is especially useful 
data communications applications where It is necessary to use a 
parity bit for transmissiorVreceptlon error checking It also features 
a Retransmit ( RT) capability that allows for reset of the read pom*? 
to Its initial position when RT Is pulsed low to allow for retransmis- 
sion from the beginning of data A Half Full Flag Is available m th 
single device mode and width expjansion modes 

The IDT7202A is fabricated using IDT s high sp>eed CEM0& 
technology It is designed for those applications requiring asy^ 
chronous arKJ simultaneous read/writes in multiprocessing anti 
rate buffer applications The 1024 x 9 organization of the IDT7202^ 
allows a 1024 deep word structure without the need forexpansionj 
Military grade product Is manufactured in compliance with thJ 
latest revision of MIL-STD-883. Class B { 


PIN CONFIGURATIONS 
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FUNCTIONAL DESCRIPTION ^ 

1 he ADhP 3128 Muhipon Regmer File eonims of « high-speed Three VlcBi^lr^rcscnicd to RAM in clock HI from the 
stitic RAM (configurable as cither 128 x 16 or 64 x 32)*uiTOUode<t 4 ^idr^ JBbtfir, and Eadr address latches Normally in clock HI, 
bv the latches and control logic needed for simple syiterritolttyj ^ flieste arc RAM^nt&ddrcsses They are prioritized in case of 
facing (sec Figure 1) Six internal data jgjbL 111^2 ^ts Wjdi^Si ^ conOictjpTltcejStlrclics arc presented to RAM in clock LO 
connect this RAM with muliiplexw»<Bi«GC alQ UtAei’ thr«^ ^^from’ihlCjdrMaJr. and 1 adr address latches Normally in 
arc read data paths, three arc writAWta {hilrs. Three 7 hu « '| Vclobclcif^ltiese are RAM read addresses Three simultaneous 
internal address paths connect this RAM with muxes and adc<ftss\ Vea'ds from the same RAM location arc possible for normal, 
latches These three address paths are time-mu1tipl«»e<l.io allow clock LO reads The EadrPort feeds both a wrttc (clock HI) 
the presentation of six addresses to the RAM'per hyrie Hence, g^^ddri^ lachlirid a read (clock LO) address latch, which can be 
up to a total of six reads from and writes to theKAM arc po^ible | IndcpAdenllv set to latched or transparent modes 
IxrcScIc Hccause of the abundance of data paths, many ofl^ 5 
which can transfer data twice per cycle, many combinations of 
SIX reads and writes are possible | 
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ADSP 3128 Multiport Register File Functional Block Diagram 
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AScBBt A&B& 

Eiah Ewft Dcftcnption 

X X Disable chip (consistcni wiih pipelines) but advance pipelines with clock cycle 

X X Register u rue data at A & B or Edata input latches on falling edge 

X X Holdmostrccenidaiaai A & Bor Edata input latches for thenexi cycle 

X X Latch s^nie data at A A Bor Edata input latches at clock HI 

X X Make transparent A A Bor Edata input latches 

0 X Allow write to RAM from the A, B, and Edata input latches 

1 X Inhibit write to Ra^M from the A, B> and Edata input latches 

0 0 Normal write to RAM from the A or Bor Edata input latches during clock HI 

X 1 Flow-through wntc(transparcnt) to RAM from the A or Bor Edata input 

latches dunngcI(Kk I O when wriic/rcad addresses arc equal 
X X Early Input to A & Bor I Alata input latches register I SW on fallingcdgc to 

inpLt latches and latch MSW to input latches in clock HI 
X 0 Late Input to A A B or Edata input latches latch I SW to input latches in clock 

HI and make input laichcstransparcntforMSW in clock HI 
X 1 Undefined ^ ^ ^ 

X X Hold most rct.cni daiaat A A Bor l^ate input latches for the next cycle 

X X Edata Slow Input register LSW to EdaU input latch on next falling edge (Lhi onl>) 

X X Edata Slow Inputircghl^MS^^to Fdaia input latch on next falling edge (Eht only) 

, Tab/e I A DSPS 1 28 SumPbaryo fdaiaJn%utVnd Wnte C6n trof Modes 



A ptsab1ecmp(SDnci^e^ with pipelines ; but advance pipelines with clock cycle 

itfyHRrc datafrom output latchoatbroughCorDorEdata-Pori 
^4 Tprhrc^^atc (high impedanccfoutput Cor Dor Edaia-Pon 

X. register data froinRAMioCAt Dor Edata output latches on rising edge 


X 

X 


X 

X 


X 

0 


CA Doi;l data output latches arc transparent clock I () latched clock HI 
CAD orEdata RAM output latches arc fully transparent for both phases 
If Awinh/BwmhyEwinh = I , an additional rcad(s) can be performed at 
C/D/Edaia, rcspccuscly 

Edata- Port is configured for one read, one write, or two reads per cycle 
Edata Port is configured for both a read and a write every cycle, read Eadr 
I IS registered on falling edge and write Eadr is registered on rising edge 

X 0 (Configured for Late Read at CAD or Edaia-Pori register LSW A MSW from RAM 

to output latches on rising edge, output LSW in clock HI, output MSW on next 
clock LO 

X 0 Configured for Early Read at (CAD or Edata-Port output LSW from RAM through 

transparent output latches m clock LO, latch M SW' to output latches and output 
in clock HI 

X 1 (Configured for Edata Slow Read hold RAM read data at Edata output latch, output 

, LSW at clock HI 

' X 1 Configured for Edata Slow Read hold RAM read data at Edata output latch, output 

MSW at clock HI 
X X Undefined 

Tabie II ADSP-31 28 Summary of Data Read and Output Control Modes 
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Dcftcription 

Disable chip (consistent with pipelines) but advance pipelines with clock c>clc 
Latch A or B or Eadr wriu addresses at clock H I * 

A or B or Eadr v nlc address latches arc transparent * 

Register C or D or Eadr read address latches on the rising edge^ 

C or D or Eadr read address latches arc transparent* 

Disable A'B OD/Edaia-Port 
Enable ABC D/Edaia Port 


I overndciWadtmtndRadcniatiheEadr PorHonl)! rhiiu.w-hcnEio^ !, Eadri^nit addrcsvesarc 


registered on nsing edges ind retd addresses tre registered on failing edgc> rcgaxdJcs.s of ibc slate of ^ adrn and 
Radtrn The other four address ports are unaffected by Eioand aJtta>s behasc as described in this table 

ADSP 3 128 Summary of Address Control Modes 



